Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views41 pages

Chapter1 3

Uploaded by

Carlos - Tam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views41 pages

Chapter1 3

Uploaded by

Carlos - Tam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Chapter 1

The BEAR Assessment System:


Overview of the "4 Building Blocks" approach

1.0 Book Overview


This chapter serves as an overview of the measurement approach that forms the
basis for this book. It begins with a broad description of what is meant by
“measurement”, and describes a process called construct modeling for the measurement
of psychosocial variables. Generally, I will call the method of gathering information
about a variable the measurement instrument, but in different contexts, this may be a
psychological scale, an achievement test, a behavior checklist, or a survey. The remainder
of the chapter then outlines a specific instrument development framework, which I call
the BEAR Assessment System (BAS) which is based on this understanding. This chapter
summarizes all four “building blocks” that guide instrument development and uses a
specific example in educational measurement to illustrate the points being made. The
four following chapters (i.e., Chapters 2-5) each describe one of the building blocks in
detail: the construct map, the items design, the outcome space, and the calibration model.
Seven specific example applications, which I call “exemplars” are used in the following
chapters to illustrate how each building block operates (see Appendix A: The
Exemplars). The function of each block is mirrored in an online instrument development
environment called the BEAR Assessment System Software (BASS), and its application
to the exemplars can be found at the website noted in Appendix B of the book: The
BASS instrument development and analysis environment.

Having developed a measuring instrument, then one must engage in quality


control of the measurements, and this is the focus of the next three chapters. Chapter 6
describes processes of estimating and evaluating the calibration model. Together,
Chapters 7 and 8 describe ways to examine the trustworthiness of the instrument and its
products, focusing on precision (in Chapter 7) and validity and fairness (in Chapter 8).
The many different analyses conducted in these chapters are also carried out in the BASS
environment and may be accessed at the BASS website. The final chapter invites the
reader to look beyond each of the 4 building blocks to begin their own exploration of
measurement, both as an application in the topics they find interesting, and as a topic in
its own right.

Key Concepts: BEAR Assessment System (BAS), construct modeling, the “4 building
blocks,” construct map, items design, outcome space, and measurement model.

1
1.1 What is “Measurement”?
Measurement is widely practiced in many domains, such as science,
manufacturing, trade, medicine, health, psychology, education and management. It is the
aim of this book to focus particularly on measurement in domains where the intent is that
human attributes (or properties) are to be measured, attributes such as their achievements,
their attitudes, or their behaviors. Typically, these are measured using instruments such
as psychological scales, achievement tests, questionnaires and behavioral checklists. The
reason for gathering the measurements may range from making a decision about just a
single person, to making decisions about social groups (such as schools, businesses, etc.),
including groups involved in a research study (such as a psychological experiment, or a
design experiment); sometimes the context does not require an explicit decision-making
context, but instead the purpose is to monitor and track certain aspects such as changes
over time.

The general approach to measurement adopted here is one that is especially


pertinent to the social domains, but also to physical and biological sciences as well. A
general definition is that ...
measurement is an empirical and informational process, designed on purpose,
whose input is an empirical property of an object and that produces information in
the form of values of that property. (Mari et al, 2021, p. 39)
In this definition, the term “property” is used for the real-world human characteristic that
we wish to measure, generally labelled as an attribute, or a latent trait, but also more
specifically (depending on context) as an ability, an attitude, a behavior, etc. Thus
measuring is a designed process, not merely a matter of finding something adventitiously,
and the outcome is a value (sometimes a number, sometimes an output category), a piece
of information about the person. Important qualities that should obtain to measurements
are objectivity and intersubjectivity. Objectivity is the extent to which the information
conveyed by the measurement concerns the property under measurement and nothing
else. Intersubjectivity is enhanced to the extent that that information is interpretable in the
same way by different measurers in different places and times.

Going beyond this very basic definition, measurement is also characterized by an


assessment of the quality of the information conveyed by the output (Mari et al., 2021,
pp. 59-61): “every measurement is tainted by imperfectly known errors, so that the
significance which one can give to the measurement must take account of this
uncertainty” (ISO, 1994: Foreword). The evaluation of random and systematic variations
in measurements is traditionally ascertained through summaries of the typical variations
in measurements over a range of circumstances (i.e., random errors) and investigation of
biases in the measurement (i.e., systematic errors), the two together helping to establish
the trustworthiness of the measurement. In the social sciences, these two aspects have
been referred to, broadly, as relating to the reliability of the measurements, and the
validity of the measurements, respectively, and these will be examined in greater detail in
Chapters 7 and 8. Other terms are used in the physical sciences such as precision and
trueness, respectively, combining together to determine accuracy of the measurement
(Mari et al., 2021, pp. 49-50)1.
1
A more up-to-date terminology used in the physical sciences involves measurement uncertainty, which
includes multiple aspects (Mari et al, 2021, pp. 55-58). But we will use the more traditional social science

2
The approach adopted here is predicated on the idea that there is a single
underlying attribute that an instrument is designed to measure. Many surveys, tests, and
questionnaires are designed to measure multiple attributes—here it will be assumed that,
at least in the first instance, we can consider those characteristics one at a time, so that the
full survey or test is seen as being composed of several instruments each measuring a
single attribute (although the instruments may overlap in terms of the items). This
intention is established by the person who designs and develops the instrument (the
instrument developer) and is then adopted by others who also use the instrument (the
measurers). Correspondingly, the person who is the object of measurement will be
called the respondent throughout this book—as they are most often responding to
something that the measurer has asked them to do—although that will be made more
specific in particular contexts, such as “student” where the context is within education,
“subject” where it is a psychological study, “patient” when it involves medical practice,
etc. Note that, although the central focus of the applications in this book is the
measurement of human attributes, this is not a limitation of the general procedures
described—they can be applied to properties of any complex object—and this will be
commented upon when it is pertinent.

<Insert Textbox 1.1 about here>

The measurements that result from applying the instrument can seen as the result
of a scientific argument (Kane, 2006) embodied in the instrument and its design and
usage. The decision may be the very most basic measurement decision that a respondent
has a certain value on the attribute in question (as in the Basic Evaluation Equation—
Mari et al. 2021, p. 117), or it may be part of a larger context where a practical or a
scientific decision needs to be made. The building blocks of the BAS that are described
below can thus be seen as a series of steps that can be used as the basis for this argument.
First, the argument is constructive; that is, it proceeds by constructing the instrument
following a design logic based on the definition of measurement above (this occupies the
contents of Chapters 2 through 5). Then the argument is reflective, proceeding by
gathering data on the instrument’s functioning in an empirical situation, and interpreting
the resulting information on whether the instrument did indeed function as planned in
terms of validity and reliability (this occupies the contents of Chapters 6 to 8).

Thus, in this book then, the concept that is being explored is more like a verb,
“measuring,” than a noun, “measurement.” In general, the approach here can be seen as
being an embodiment of Mislevy’s sociocognitive approach to human measurement
(Mislevy, 2018) and also as an example of Principled Assessment Design (see Chapter 9
and also: Ferrara et al., 2016; Nichols et al, 2016; Wilson & Tan, in press). There is no
claim being made here that the procedures described below are the only way to make
measurements—there are other approaches that one can adopt. The aim is not to survey
all such ways to measure, but to lay out one particular approach that the author has found
successful over the last three and a half decades of teaching measurement to students at
the University of California, Berkeley, and consulting with people who want to develop
instruments in a wide variety of areas.

labels in this book, to avoid confusion (see Chapters 7 and 8).

3
1.1.1 Construct Modeling
What is the central aim of the BAS?

The general approach to measurements that is described in this book, which I call
construct modeling, is based on a constructive way of understanding the process of
measurement. It is closely related to the approach taken by the (US) National Research
Council (NRC) in a Committee report on the status of educational assessment at the turn
of the century (NRC, 2001). The Committee laid out what has become a broadly
accepted formulation of what should be the foundations of measurement in that field (and
more broadly, in social sciences). According to the Committee (see the “NRC
Assessment Triangle” in Figure 1.1):
First, every assessment is grounded in a conception or theory about how people
learn, what people know, and how knowledge and understanding progress over
time. Second, each assessment embodies certain assumptions about which kinds
of observations, or tasks, are most likely to elicit demonstrations of important
knowledge and skills from students. Third, every assessment is premised on
certain assumptions about how best to interpret the evidence from the
observations in order to make meaningful inferences about what students know
and can do.” (p. 16)
In Figure 1, the three foundations are labeled as Cognition, Observation, and
Interpretation. The foundations then, are seen as constituting a guide for “the process of
collecting evidence to support the types of inferences one wants to draw ... referred to as
reasoning from evidence [emphasis in original]” (Mislevy 1996, p. 38). Thus, construct
modeling can also be seen as an example of evidence-centered design for measurement
(Mislevy et al., 2003).

Figure 1.1. The National Research Council (NRC) Assessment Triangle

Cognition

Interpretation Observation

4
1.2 The BEAR Assessment System (BAS)
What are the parts of the BAS, and how do they relate to measuring?

The BEAR Assessment System (BAS; Wilson & Sloane 2001) is an application
of construct modeling. It uses four “building blocks” to address the challenges embodied
in the NRC Triangle: (a) construct map, (b) items design, (c) outcome space, and (d)
calibration model. These building blocks are shown in Figure 1.2 in the form of a
cyclical sequence that occurs during assessment development, a cycle which may also
iterate during that development. These four building blocks are each an application of
(parts of) the three foundations from the NRC Triangle. Hence, the foundations are also
seen as being principles for assessment development. The match, in sequence, is
(a) the construct map is the embodiment of the principle of Cognition,
(b) the items design is the practical plan for carrying out Observation, and
(c) outcome space and calibration model jointly enable Interpretation.
This correspondence is explained below, in the respective sections of Chapter 1, using a
single example to help make the points concrete, and then each of the next four chapters
is devoted to one of the building blocks, in turn, giving further examples to elucidate the
range of application.

Figure 1.2 The four building blocks in the BEAR Assessment System (BAS)

Construct Items
Map Design

Calibration Outcome
Model Space

In this chapter, the 4 Building Blocks will be illustrated with a recent example
from educational assessment—an assessment system built for a middle school statistics
curriculum that leans heavily on the application of learning sciences ideas in the “STEM”
(science/technology/engineering/mathematics) domain, the Data Modeling curriculum
(Lehrer et al., 2014). The Data Modeling project carried out jointly by researchers at
Vanderbilt University and UC Berkeley, was funded by the US National Science
Foundation (NSF) to create a series of curriculum units based on real-world contexts that
would be familiar and interesting to students. The goal was to make data modeling and
statistical reasoning accessible to a larger and more diverse pool of students along with
improving preparation of students who traditionally do not do well in STEM subjects in
middle school. The curriculum and instructional practices utilize a learning progression
to help promote learning. This learning progression describes transitions in reasoning
about data and statistics when middle school students are inducted into practices of

5
visualizing, measuring, and modeling the variability inherent in contextualized processes.
In the Data Modeling curriculum, teaching and learning are closely coordinated with
assessment.

1.3 The Construct Map


How should the attribute be described?

The most obviously prominent feature of the measurement process is the


instrument —the test, the survey, the interview, etc. But, when it comes to developing
the measurement process itself, the development of the instrument is actually somewhat
down that track. Pragmatically, the first inkling is typically embodied in the purpose for
which an instrument is needed and the context in which it is going to be used (i.e.,
involving some sort of decision). This need to make a decision is often what precipitates
the idea that there is an attribute of a person that needs to be measured. Thus, even
though it might not be the first step in measurement development, the definition of the
attribute must eventually take center place.

Consistent with current usage, the attribute to be measured will be called the
construct (see Messick (1989) for an exhaustive analysis). A construct could be a part of
a theoretical model of a person’s cognition, such as their understanding of a certain set of
concepts, or their attitude toward something, or it could be some other psychological
variable such as “Need for Achievement” or a personality variable such as a
Extraversion. It could be from the domain of educational achievement, or it could be a
health-related construct such as “Quality of Life,” or a sociological construct such as
“rurality” or migrants’ degree of assimilation. It could relate to a group rather than an
individual person, such as a work group or a sports team, or an institution such as a
workplace. It can also take as its object something that is not human, or composed of
humans, such as a forest’s ability to spread in a new environment, a volcano’s proclivity
to erupt, or the weathering of paint samples. There is a multitude of potential constructs
—the important thing here is to have one that provides motivation for developing an
instrument, a context in which the instrument might be used, and, ideally, a theoretical
structure for the construct.

The idea of a construct map is a more precise concept than “construct.” First, we
assume that the construct we wish to measure has a particularly simple form—it extends
from one end of the construct to another, from high to low, or small to large, or positive
to negative, or strong to weak—one might say that it is unidimensional, though that has
technical meanings that extend beyond the conceptual scope here (see Figure 1.3). The
second assumption is that there are consecutive distinguishable qualitative points between
the extremes. Quite often the construct will be conceptualized as describing successive
points in a process of change, and the construct map can then be thought of as being
analogous to a qualitative “roadmap” of change along the construct (see e.g., Black et al.,
2011). In recognition of this analogy, these qualitatively-different locations along the
construct will be called “waypoints”—and these will, in what follows, be very important
and useful in interpretation. Each waypoint has a qualitative description in its own right,
but, in addition, it derives meaningfulness by reference to the waypoints below it and
above it. Third, we assume that the respondents can (in theory) be at any location in

6
between those waypoints—that is, the underlying construct is dense in a conceptual
sense.

There have been historically preceding concepts that have been formative in
developing the idea of a construct map. They each feature some aspects of the idea
described in the preceding paragraph, but none are quite the same. Probably the most
prominent example is Bloom’s Taxonomy (Bloom, 1994), which focusses on behavioral
objectives (in education) as central planning tools for educational curricula, and which
features hierarchies of levels of objectives for the cognitive (Bloom, et al., 1956) and
affective (Kratwohl et al., 1964) domains—for example, the levels of the cognitive
domain have the following labels: Remember, Understand, Apply, Analyze, Evaluate,
and Create (from the latest revision, Anderson et al. (2001)). Historically, this has been
broadly used in educational circles around the world, and its status as a predecessor must
be acknowledged. However, there are important distinctions between Bloom’s
Taxonomy and concept of a construct map:
(a) Bloom’s Taxonomy is a list of objectives, designed to help plan a sequence of
instruction—there is no theoretical necessity to see them as defining a single
construct,
(b) the levels of the Taxonomy, being behavioral objectives, are not necessarily good
targets for designing assessments (although many educational researchers have
used them in that way),
(c) the Taxonomy is seen as being universal, ranging across almost any cognitive or
affective variable, which contrasts with the requited specificity of the construct in
the construct map, and
(d) there is no posited relationship between the underlying construct and the Taxonomy’s
equivalent of the waypoints (i.e., “Bloom’s levels”).
Other important precedents include the stage theories of Jean Piaget (regarding cognitive
development; Flavell (1963), Inhelder & Piaget (1958)), and the learning hierarchies of
Robert Gagne (regarding conceptual learning; Gagne (1968)). Each of these have
similarities and distinctions from the concept of a construct map (for discussion of each,
see Wilson (1989)).

In Figure 1.3, there are four illustrative waypoints—these will be defined within
the theoretical context (i.e., semantically), and are ordered within that same theory—and
the line running between them represents the (dense) possible locations where a
respondent might lie. The waypoints, although substantively-ordered, as noted, are not
necessarily equally-spaced apart on the metric of the construct—and so, that is also
exemplified in Figure 1.3. As yet, there is no metric that has been developed—no scale,
so to speak, so not expectations on this can usefully be entertained at this point. To
reiterate, at this initial stage of instrument development, the construct map is still an idea,
a latent rather than a manifest conception. For those familiar with psychometric models in
the social sciences, one can think of a construct map as a special sort of unidimensional
latent variable (i.e., the underlying latent variable for a monotone unidimensional latent
variable model2 (Junker & Ellis, 1997)). It is special in the sense that the underlying

2
These include unidimensional item response models (Rasch models 2PL and 3PL IRT models) as well as
unidimensional factor analysis models). Note that no items have as yet been introduced, so concepts like
monotonicity and independence are not yet relevant to the conceptualization. at this point

7
latent variable is not just a uniform scale, but also has special waypoints in it (which are
derived from the substantive content of the construct).

Figure 1.3 Illustration of a generic construct map, incorporating qualitative person-side


waypoints.

Highest qualitatively distinguished point (4)

Intermediate point (3)

Intermediate point (2)

Lowest qualitatively distinguished point (1)

Constructs are often more complex than this: First, they may be
multidimensional3. This is not a barrier to the use of the methods described in this book
—the most straightforward thing to do is to tackle each of the multiple dimensions one at
a time—this allows each be seen as being potentially representable by a construct map.
The structure of some constructs also precludes the possibility of being described well by
a construct map. For example, suppose the construct consists of two different latent
groups, say, those who are likely to immigrate, and those who are not. Given that this
construct does not lie on a continuum like a typical construct map, it is not likely to be
well-represented by one. If, however, these groups were defined by their locations on an
underlying single dimension, then there may be value in thinking of this as a very
diminutive version of a construct map.

Second, there may be some complexity in what happens in between the extremes
in a single dimension. For example, there may be different ways that a particular point-
of-interest is expressed under different circumstances, and these differences may relate to
the underlying scientific theory, or to the way that observations are designed in the
context of a particular instrument. An example like this is illustrated in Section 10.3.

1.3.1 Example 1: The MoV Construct in the Data Modeling Assessments

3
In a multidimensional case, where the underlying constructs are not orthogonal, it can be advantageous to
recruit information from other dimensions to support person measurement in each dimension. In that case,
it would make sense to estimate multidimensional models rather than the unidimensional models used in
this book. See Section 8.4 to get help in investigating this possibility, and also Schwartz et al. (2017) for an
example of how the approach adopted here can also work in such cases.

8
Both the Data Modeling curriculum (introduced at the end of Section 1.2), and its
assessment system are built on a set of six constructs or strands of the learning
progression (the complete set of all six are described in Section 9.1). These have been
designed to describe how students typically develop as they experience the Data
Modeling curriculum. Conceptual change along these constructs was encouraged by
instructional practices that served to engage students into versions of central professional
practices of statisticians and data scientists adapted to a school context and appropriate
for young students. As is typical among professionals, students developed and iterated
how to visualize, measure, and model variability. In the account here, the primary focus
is on the assessment aspects of the learning progression, mainly concentrating on the
development and deployment of the assessments. Further information on the instructional
aspects of the learning progression can be found in Lehrer et al. (2020).

A construct map consists of an ordered list of typical waypoints that students


reach as they progress through a series of ways of knowing: in the case of MoV, this
represents how the students in the DM curriculum typically learn how to devise and
revise models of variability. The book’s initial example of a construct map is the “Models
of Variability” (MoV) construct map4 and this is shown in Figure 1.4. As student
conceptions of chance develop, the MoV construct describes a progression in creating and
evaluating mathematical models that feature random variation. These models of chance
initially focus on the identification of sources of variability, then advance to the
incorporation of chance devices to represent the (mathematical) mechanism of those
sources of variability, and, at the highest waypoint, involve the judgment of how well the
model works (i.e., model fit) by examining how repeated model simulations relate to an
empirical sample. In the Data Modeling curriculum, a student’s ideas about of models
and modeling are scaffolded by classroom practices that encourage students (usually
working together in small group) to develop and critique models of data generation
processes.

4
Details about the MoV example have been previously published in Wilson & Lehrer (2021).

9
Figure 1.4 The MoV construct map. (Note that the distances between waypoints on this
map are arbitrary.)

MoV5: Account for variability among different runs of model simulations to judge
adequacy of model.

MoV4: Develop emergent models of variability.

MoV3: Use a chance device to represent a source of variability or the total variability
of the system.

MoV2: Informally describe the contribution of one or more sources of variability to


variability observed in the system.

MoV1: Identify sources of variability.

At the first point-of-interest, shown at the bottom of Figure 1.4 and labelled
“MoV1,” students associate variability with particular sources of variability, which the
curriculum encourages by asking students to reflect on random processes characterized
by signal and noise. For example, when considering variability of measures of the same
object’s width, students may consider variability as arising from errors of measurement,
and “fumbles” made by some measurers because they were not sufficiently precise. To
be judged as being at this initial point, students should say something about one or more
sources of variability but not go so far as to implicate chance origins to variability. Note
that, in actually recording such observations, it is also common that some students will
make statements that do not reach this initial point on MoV, and hence these might be
labelled as being at MoV0 (by implication, below MoV1), but this story is a little more
complicated, as will be noted in Section 1.2.3.

When students get to Waypoint MoV2, they informally begin to order the relative
contributions of different sources to variability, using wording such as “a lot” or “a little.”
Thus, students are referring (implicitly though not explicitly) to mechanisms that they
believe can cause these distinctions, and they typically predict or explain the effects on
variability. Consider the following example of a conversation between two students each
of whom has measured the perimeter of a table. After noting the difference between their

10
results, they discuss the possible reasons for the difference. One reason they noted is a
“false start” ...

Cameron: How would we graph--, I mean, what is a false start, anyway?


Brianna: Like you have the ruler, but you start at the ruler edge, but the ruler
might be a little bit after it, so you get, like, half a centimeter off.
Cameron: So, then it would not be 33, it’d be 16.5, because it’d be half a
centimeter off?
Brianna: Yeah, it might be a whole one, because on the ruler that we had, there
was half a centimeter on one side, and half a centimeter on the other side, so it
might be 33 still, and I think we subtract 33.
Cameron: Yeah, because if you get a false start, you’re gonna miss. (Lehrer,
Schauble & Wisittanawat, 2020)

This conversation exemplifies students attempt to delineate the nature of each source of
variability and debate how important the source is in a model.

The move up to MoV3 is an important transition in student reasoning. At this


point, students begin to explicitly think about chance as contributing to variability. In the
Data Modeling curriculum, students initially experience and investigate simple devices
and phenomena where it is agreed (by all in the classroom) that the behavior of the device
is “random.” An example of one such device is the type of spinner illustrated in Figure
1.5. The students have initial experiences with several different kinds of spinners
(different numbers of sectors, different shapes, etc.). Then they are engaged in a design
exercise where students provided with a blank (“mystery”) spinner and are asked to draw
a line dividing the spinner into two sectors which would produce different proportions of
the outcomes of the spinner, say 70% and 30%. This is then repeated for different
proportions and different numbers of categories, so that students understand how the
geometry of the spinners embody both the randomness and the structural features of
chance. The conceptual consequences such investigations are primarily captured in
another construct (i.e., the Chance construct, see Section 9.1), but in terms of the
modeling, students also are led to understand that chance can be a source of variability
even in modeling.

11
Figure 1.5 Illustration of a spinner.

At MoV4, the challenge to students is to transition from thinking about single


(random) sources of variability (such as the spinner in Figure 1.5) to conceptualizing
variability in a process as emerging from the combination of multiple sources of
variation, some of which might be random and some not. For example, a distribution of
repeated measurements of an attribute of an object might be thought as a combination of
a fixed amount of a respondent’s attribute (which they might think of as the “true
amount”) and one or more components of chance error in the measurement process.
Within the ADM curriculum, one such exercise involves teams of students measuring the
“wingspan” of their teacher (i.e., the width of the teacher’s reach when their arms are
spread out). Here the random sources might be
(a) the “gaps” that occur when students move their rulers across the teacher’s back,
(b) the “overlaps” that occur when the endpoint of one iteration of the ruler overlaps with
the starting point of the next iteration, and
(c) the “droop” that occurs when teacher becomes tired and their outstretched arms droop.
The students not only consider these effects, but, in this curriculum, they go on to model
them using assemblies of spinners, initially in their usual physical format, and advancing
to a virtual representation on a computer (which makes simulations easier, of course!).

When they move on to MoV5, students arrive at an evaluative point, where they
consider variability when judging the success of the models they have devised
previously. They run multiple simulations and observe that one run of a model’s
simulated outcomes may fit an empirical sample well (e.g., similar median and inter
quartile range values, similar “shapes” etc.) but the next simulated sample might not. In
this way, students are prompted by the teacher to imagine running simulations repeatedly,

12
and thus can come to appreciate the role of multiple runs of model simulations as a tool
that can be used to ascertain the success of their model. This is a very rich and
challenging set of concepts for students to explore, and, eventually, grasp. In fact, even
when assessing the abilities of students in this domain at the college entry level, we have
found that the accurate understanding of sampling statistics is indeed rather poorly
mastered (see for example, Arneson et al., 2018).

The Data Modeling MoV construct map is an example of a relatively complete


construct map. When a construct map is first postulated, it is often be much more
nebulous. The construct map is refined through several processes as the instrument is
being developed. These processes include: (a) explaining the construct to others with the
help of the construct map, (b) creating items that you believe will lead respondents to
give responses that inform the waypoints of the construct map, (c) trying out those items
with a sample of respondents, and (d) analyzing the resulting data to check if the results
are consistent with your intentions, as expressed through the construct map. These steps
are illustrated in the three building blocks discussed in the next three sections of this
chapter.

1.4 The Items Design


What are the critical characteristics of the items?

The next step in instrument development involves thinking of ways in which the
theoretical construct embodied in the construct map could be manifested via a real-world
situation. At first, this may be not much more than a hunch: a context where the
construct is involved or plays a determining role. Later, this hunch will become more
crystallized, and settle into certain patterns. The time-ordered relationship between the
items and the construct is not necessarily one-way as it has just been described in the
previous section. Oftentimes, the items will be thought of first, and the construct will be
elucidated only later—this is simply an example of how complex a creative act such as
instrument construction can be. The important thing is that the construct and the items
should be distinguished, and that eventually, the items are seen as prompting realizations
of the construct within the respondent.

For example, the Data Modeling items often began as everyday classroom
experiences and events that teachers have found to have a special significance in learning
of variability concepts. Typically, there will be more than one real-world manifestation
used in the instrument; these parts of the instrument are generically called “items,” and
the format in which they are presented to the respondent will be called the items design,
which can take many forms. The most common ones are the multiple-choice format used
in achievement testing and the Likert-type format from surveys and attitude scales (e.g.,
with responses ranging from “strongly agree” to “strongly disagree”). Both are examples
of the “selected response” item type, where the respondent is given only a limited range
of possible responses and is forced to choose amongst them. There are many variants of
this, ranging from questions on questionnaires, to consumer rankings of products. In
contrast, in other types of items the respondent may also produce a “constructed
response” within a certain mode, such as an essay, an interview, a performance (such as a
competitive dive, a piano recital, or a scientific experiment). In all of these examples so
far, the respondent is aware that they are being observed, but there are also situations

13
where the respondent is unaware of the observation. A person might be involved in a
game, for example, where an observer (human or automated) might record a certain suite
of behaviors without the gamer being aware of the situation. Of course, in addition, the
items may be varied in their content and mode: Interview questions will typically range
over many aspects of a topic; questions in a cognitive performance task may be presented
depending on the responses to earlier items; items in a survey may use different sets of
options—and some may be selected response and some constructed response.

1.4.1. Example 1: MoV Items

In the case of the Data Modeling assessments, the items are deployed in number
of ways: (a) as part of summative pretest and posttest, (b) as part of meso-level
assessment following units of instruction (Wilson, 2021), and (c) as parts of micro-level
assessment in the form of prompts for “assessment conversations” where a teacher
discusses the item and student suggested responses with groups of students.

To illustrate the way that items are developed to link to the construct map,
consider the Piano Width task, shown in Figure 1.6. This task capitalizes on the Data
Modeling student’s experiences with ruler iteration errors (i.e., the “gaps and laps” noted
in the previous section) in learning about wingspan measurement as a process that
generates variability. Here the focus is on question 1 of the Piano Width item: The first
part –1(a) –is intended mainly to prompt the student take one of two positions—there are,
of course, several explicit differences in the results shown in the two displays, but
question 1(a) focusses the students’ attention on whether the results show an affect due to
the students’ measurement technique (i.e., short ruler versus long ruler). Some students
may note that the mode is approximately the same for both displays, and consider that
sufficient to say “No.” Thus, this question, although primarily designed to set up the part
1(b) question, is also addressing a very low range on the MoV construct—these students
are not able to specify the source of the variation, as they are not perceiving the spread as
being relevant. Following this set-up in question 1(a), the second part, 1(b) does most of
the heavy lifting for MoV, exploring the students’ understanding of how measurement
technique could indeed affect variation, and targeting MoV2. This question does not
range up beyond that, as no models of chance or chance devices are involved in the
question.

14
Figure 1.6 The Piano Width task

A group of musicians measured the width of a piano in centimeters. Each musician in this
group measured using a small ruler (15 cm long). They had to flip the ruler over and over
across the width of the piano to find the total number of centimeters. A second group of
musicians also measured the piano’s width using a meter stick instead. They simply laid
the stick on the piano and read the width. The graphs below display the groups’
measurements.

12
83

count, ordered by MeterStick


82
count, ordered by Ruler

10
C C 82
82
o 8 o 8
84 82
u 84 u 81
nt 6 83 nt 6
81
of 78 82 of 81
4 4
m 78 81 88 m 79 81

ea 2 73 76 81 87
ea 2
76 81 86
72 75 80 86 91 73 76 81 86
su su
70 75 80 85 90 70 76 81 85 91
re 0 re 0
70-74 75-79 80-84 85-89 90-94 70-74 75-79 80-84 85-89 90-94
Ruler M e te rStick

Piano width (cm) using the small ruler Piano width (cm) using the meter stick

1(a) The two groups used different tools.


Did the tool they used affect their measurement? (check one)

( ) Yes ( ) No

1(b) Explain your answer.


You can write on the displays if that will help you to explain better.

2(a) How does using a different tool change the precision of measurements? (check one)

(a) Using different tools does not affect the precision of measurements.
(b) Using the small ruler makes precision better.
(c) Using the meter stick makes precision better.

2(b) Explain your answer.


(What about the displays makes you think so?)

15
1.4.2 The Relationship Between the Construct and the Responses

The initial situation between the first two building blocks can be depicted as in
Figure 1.7. Here the construct and the items are both only vaguely known, and there is
some intuitive relationship between them (as indicated by the curved dotted line).
Causality is often unclear at this point, perhaps the construct "causes" the responses that
are made to the items, perhaps the items existed first in the measurement developer’s
plans and hence could be said to "cause" the construct to be developed by the
measurement developer. It is important to see this as an important and natural step in
instrument development—a step that often occurs at the beginning of instrument
development, and that may recur many times as the instrument is tested and revised.

Figure 1.7 A picture of an initial idea of the relationship between construct and item
responses.

Construct Responses
to items

Unfortunately, in some instrument development efforts, the conceptual approach


does not go beyond the state depicted in Figure 1.7, even when there are sophisticated
statistical methods used in the data analysis (which, in many cases do indeed assume a
causal order). This unfortunate abbreviation of the instrument development process,
mainly associated with an operationalist view of measurement (Mari et al, 2021) will
typically result in several shortcomings:
(a) arbitrariness in choice of items and item formats,
(b) no clear way to relate empirical results to instrument improvement, and
(c) an inability to use empirical findings to improve the conceptualization of the
construct.
To avoid these issues, the measurer needs to build a structure that links the construct
closely to the items—one that brings the inferences as close as possible to the
observations.

One way to do that is to see causality as going from the construct to the items—
the measurer assumes that the respondent “has” some amount of the construct, and that
amount of the construct is conceived of as a cause of the responses to the items in the
instrument that the measurer observes. That is the situation shown in Figure 1.8--the
causal arrow points from left to right. However, this causal agent is latent—the measurer
cannot observe the construct directly. Instead, the measurer observes the responses to the

16
items, and must then infer the underlying construct from those observations. That is, in
Figure 1.8, the direction of the inference made by the measurer is from right to left. It is
this two-way relationship between the construct and the responses that is responsible for
much of the confusion and misunderstanding about measurement, especially in the social
sciences—ideas about causality gets confounded with ideas about inference, and this
makes for much confused thinking (see Mari et al, 2021).

Figure 1.8 A picture of the Construct Modeling idea of the relationship between degree of
construct possessed and item responses.

Causality

Construct Responses
to items
Inference

The remaining two building blocks embody two different steps in that inference.
Note that the idea of causality here is an assumption, and the analysis does not prove that
causality is in the direction shown, it merely assumes it goes that way. In fact, the actual
mechanism, like the construct, is unobserved or latent. It may be a much more complex
relationship than the simple one shown in Figure 1.8. Until more extensive research
reveals the nature of that complex relationship, the measurer will be forced to act as
though the relationship is the simple one depicted.

1.5 The Outcome Space


How can responses be categorized so the categories are maximally useful?

The first step in the inference process illustrated in Figure 1.8 is to a decide which
aspects of the response will be used as the basis for the inference, and how those aspects
will be categorized and scored. The result of all these decisions will be called the
Outcome Space in this book. Examples of familiar outcome spaces include:
(a) the categorization of question responses into “true” and “false” on a test (with
subsequent scoring as, say, “1” and “0”), and
(b) the recording of Likert-style responses (Strongly Agree to Strongly Disagree) on an
attitude survey, and their subsequent scoring depending on the valence of the
items compared to the underlying construct.
Less common outcome spaces would be:
(c) the question and prompt protocols in a standardized open-ended interview (Patton,
1980, 202-205) and the subsequent categorization of the responses, and
(d) the translation of a performance into ordered categories using a scoring guide
(sometimes called a “rubric”).

17
Sometimes the categories themselves are the final product of the outcome space, and
sometimes the categories are scored so that the scores can (a) serve as convenient labels
for the outcomes categories, and (b) be manipulated in various ways. To emphasize this
distinction, the second type of outcome space may be called a "scored" outcome space.
The resulting scores play an important role in the construct mapping approach. They are
the embodiment of the “direction” of the construct map (e.g., positive scores go
“upwards’ in the construct map).

The distinction of the outcome space from the items design, as described in the
previous section is not something that people are commonly aware of, and this is mainly
due to the special status of what are probably the two most common item formats—the
Likert-style item common in attitude scales and questionnaires, and the multiple choice
item common in achievement testing. In both item formats, the items design and the
outcome space have been collapsed—there is no need for the measurer to categorize the
responses as that is done by the respondents themselves. And in most cases, the scores to
be applied to these categories are also fixed beforehand.

However, these common formats should be seen as “special cases”—the more


generic situation is where the respondent constructs their own responses, most commonly
in a written (e.g., an essay) or verbal (e.g., a speech or an interview) format, but could
also be in the form of a performance (e.g., a dive) or a produced object (e.g., a statue). In
this constructed response type of outcome space, the responses are selected into certain
categories by a rater (sometimes called a “reader” or a “judge”). The rater might also be
a piece of software that is part of an automated scoring system, as featured in the latest
educational technology using machine-learning algorithms. That the constructed
response form is more basic becomes clear when one sees that the development of the
options for the selected responses will in most cases includes an initial development
iteration that uses the free-response format (we return to this point in Section 3.3) 5.

In developing an outcome space for a construct map, several complications can


arise. One is that, when the designed waypoints are confronted with the complications of
actual responses to item, sometimes there is found to be a useful level of sub-categories
within (at least some) waypoints. These can be useful in several ways: (i) they give more
detail about typical responses, and hence help the raters make category decisions, (ii)
they can be related to follow-on actions that might result from the measurements (e.g., in
the case of achievement tests, point teachers towards specific instructional strategies),
and, in some situations (iii) they may give hints that there may be a finer grain of
waypoints that could form the basis for measurement if there were more items/response
etc. In case (iii) this may be denoted in the category labels using letters of the alphabet
(“MoV2A, etc.), though other denotations may be more suitable when implications of
ordering is not warranted. (This is exemplified in the example in the next subsection.) A
variation on this is that sometimes there are categorizations that are considered
“somewhat higher” than a certain waypoint, or “somewhat below” in which case “+” and
“-” are simply added to the label for the waypoint: Mov2+ or MoV2-, for instance (this is
exemplified in Section 2.2).
5
That the Likert-style response format does not, in many cases, require such an initial step may be seen as
an adavantage for the developers, but see the discussion about Example 2 in Section 4.3.3 for a discussion
of that.

18
1.5.1 Example 1: The MoV Outcome Space

The outcome space for the Data Modeling Models of Variability construct is
represented in Appendix 1A—it is not shown here due to its length. Glancing at the
Appendix, one can see that the Data Modeling outcome space is conceptualized as being
divided into 5 areas corresponding to the 5 waypoints in the construct map shown in
Figure 1.3, running from the most sophisticated at the top to the least sophisticated at the
bottom. The columns in Figure 1.3 represent the following (reading from left to right):
(a) the label for the point-of-interest (e.g., “MoV5”),
(b) a description of each followed by
(c) a label for possible intermediate points and
(d) a description of each of those and finally,
(e) examples of student responses to items for each of the intermediate points.
This representation is illustrated for one of the waypoints of MoV, MoV2, in Figure 1.9.
Here we can note that MoV2 has been divided into two intermediate points, MoV2A and
MoV2B. These distinguish between (a) an apprehension of the relative contributions of
the sources of variability (MoV2A) and (b) demonstration of an informal understanding
of the process that affects that variability (i.e., MoV2B), respectively. These are seen as
being ordered in their sophistication, though the relative ordering will also depend on the
contexts in which these are displayed (i.e., we would expect this ordering within a certain
context, but across two different contexts MoV2A may be harder to achieve than
MoV2B).

19
Figure 1.9 A segment of the MoV Outcome Space (from Appendix 1A).

The outcome space for a construct is intended to be generic for all items in the full
item set associated with that construct. When it comes to a specific item, there is usually
too much detail which precludes putting it all into documents like those in Appendix 1A.
Hence, to make clear how the expected responses to a specific item relate to the outcome
space and to aid raters in judging the responses, a second type of document is needed, the
Scoring Guide, which is focused on a specific item, or, when items are of a generic type,
on an item representing the set.

For example, the scoring guide for the Piano Width item (which was shown in
Figure 1.6 above), is shown in Table 1.1 below. As noted above, the most sophisticated
responses we usually get to the Piano Width item are at MoV2, and typically fall into one
of two MoV2 categories after “Yes” is selected for question 1(a).

MOV2B: the student describes how a process or change in the process affects the
variability, that is, the student compares the variability shown by the two displays.
The student mentions specific data points or characteristics of the displays. For
example, one student wrote: “The Meter stick gives a more precise measurement
because more students measured 80-84 with the meter stick than with the ruler.”
MOV2A: the student informally estimates the magnitude of variation due to one or more
sources, that is the student mentions sources of variability in the ruler or meter
stick. For example, one student wrote: “The small ruler gives you more
opportunities to mess up.”

20
Note that this is an illustration of how the construct map waypoints may be manifested
into multiple intermediate waypoints, and, as in this case, there may be some ordering
among (i.e., MoV2B is seen as a more complete answer than MoV2A).
Less sophisticated responses are also found, such as:

MOV1: the student attributes variability to specific sources or causes, that is the student
chooses “Yes” and attributes the differences in variability to the measuring tools
without referring to information from the displays. For example, one student
wrote: “The meterstick works better because it is longer.”

Of course, students also give unclear or irrelevant responses, and the category for
this might be labelled as MoV0. But the Data Modeling developers went a step further.
For responses such as the following: “Yes, because pianos are heavy”—they labelled
them as “No Link(i)” and abbreviated this as “NL(i).” However, in the initial stages of
instruction in this topic, students also gave responses that were not clearly yet at MoV1
but was judged to be better than completely irrelevant (i.e., somewhat better than NL(i)).
Typically these responses contained relevant terms and ideas but were not accurate
enough to warrant labelling as MoV1. For example, one student wrote: “No, Because it
equals the same.” This type of response was labelled “No Link(ii),” abbreviated NL(ii),
and was placed lower than MoV1 but above NL(i) on the construct map (see Table 1.1).

21
Table 1.1 Scoring guide for the “Piano Width” item.

22
1.6 The Calibration Model
How can measurement data be analyzed to help evaluate the construct map?

Once the initial versions of the outcome space and the individual item scoring
guides have been established, the next step is to study instrument’s empirical behavior by
administering it to an appropriate sample of respondents. This results in a data set
composed of the codes or scores for each person in the sample. The second step in the
inference then, is to relate these scores back to the construct. This is done through the
fourth building block, which we will term the calibration model—sometimes it is also
called a “psychometric model,” or a “statistical model.” Since the conceptualization used
in this chapter thus far does not require that a statistical model be used, thus it may also
be termed an “interpretational” model (see the NRC Triangle in Figure 1.1). The
calibration model helps one understand and evaluate the scores that come from the item
responses. This informs the measurer about the validity of the construct, and also helps
guide the use of the results in practical applications. Simply put, the calibration model
must translate scored responses to locations on the construct map. Some examples of
calibration models are the “true-score” model of classical test theory, the “domain score”
model, factor analysis models, item response models (including Rasch-family models)
and latent class models. These are all formal models. Many users of instruments (and
some instrument developers) also use informal measurement models when they think
about their instruments.

In this book we will use the affordances of a particular type of calibration model,
the Rasch model (sometimes called a one-parameter6 logistic (1PL) model), and related
models such as the partial credit model (PCM), to help us carry out the calibration step.
This model is suitable for situations where the construct (a) is reasonably well thought of
as a single construct map (as noted above), and (b) has categorical observations. Other
situations are also common, and these will be discussed in later chapters. In this chapter,
we will not examine the statistical equations used in the calibration building block (these
are central to Chapter 5) but will instead focus on the main products of the calibration.

The interpretation of the results from the calibration is aided by several types of
graphical summaries. The graphical summaries we will use here have been primarily
generated using a particular computer application, BASS (See Appendix B). Other
software can be used for the calibration step, and several of them also generate many of
the same graphs (e.g., ConQuest (Adams et al, 2020) and TAM (Robitzsh et al., 2017)).
The most important of these graphical summaries for our purposes in this chapter is the
“Wright Map.” This graph capitalizes on the most important feature of a successful
analysis using the Rasch model: the estimated locations of the respondents on the
construct underlying construct map can be matched to the estimated locations of the
categories of item responses. This allows us to relate our hypotheses about the items that
have been designed to link to specific construct map waypoints through the response
categories. This feature is crucial for both the measurement theory and measurement
practice in a given context: (a) in terms of theory, it provides a way to empirically
examine the structure inherent in the construct map, and adds this as a powerful element
6
The “one parameter” referred to here is the single item parameter, for the item difficulty. There is also
one parameter for the respondent’s ability. These will both be explained in Section 5.2 and following
sections.

23
in studying the validity of use of an instrument; and (b) in terms of practice, it allows the
measurers “go beyond the numbers” in reporting measurement results to practitioners and
consumers, an equip them to use the construct map as an important interpretative device.

Beyond the Wright Map, the analysis of the data will include steps that focus on
each of the items, as well as the set of items, including item analysis, item fit testing, and
overall fit testing, as well as analyses of validity and reliability evidence. These will be
discussed in detail and at length in Chapters 6, 7 and 8. For now, in Chapter 1, the focus
will be on the Wright Map.

1.6.1 Example 1: The MoV Wright Map

Results from the analysis of data were collected using the Data Modeling MoV
items was reported in Wilson & Lehrer (2021). The authors used a sample of 1002
middle school students from multiple school districts to calibrate these items 7.
Specifically, they fitted a partial credit calibration model, a one-dimensional Rasch-
family item response model (Masters 1982), to the responses related to the MoV
construct. (These models will be introduced and exemplified in Chapter 5 and following.)
This calibration model incorporates statistical parameters (so-called “Thurstone
thresholds”) that correspond to the differences between successive waypoints on the
construct map. For each item, they used the threshold values to describe the empirical
characteristics of the item (Adams et al, 2020; Wilson, 2005).

The way that the item is displayed on a Wright map is as follows:


(a) if an item has k score categories, then there are k-1 thresholds on the Wright map,
one for each transition between the categories;
(b) each item threshold gives the ability location (in logits8) that a student must obtain to
have a 50% chance of success at the associated scoring category or above,
compared to the categories below.
For example, suppose a fictitious “Item A” has three possible score categories (0, 1, and
2): In this case there will be two thresholds. This is illustrated on the right-hand side of
Figure 1.10. Suppose that the first threshold has a value of - 1.0 logits: This means that a
student at that same location, -0.97 logits (shown as the “X” on the left-hand side in
Figure 1.10), has an equal chance of scoring in category 0 compared to the categories
above (categories 1 and 2). If their ability is lower than the threshold value (-1.0 logits),
then they have a higher probability of scoring in category 0; if their ability is higher than
-0.97, then they have a higher probability of scoring in either category 1 or 2 (than 0).
These thresholds are, by definition, ordered: in the given example, the second threshold
value must be greater than -1.0—as shown, it is at 0.0 logits. Items may have more than
two thresholds or just one threshold (for example, dichotomous items such as traditional
multiple-choice items).

Figure1.10 Sketch of a Wright map.

7
In a series of analyses carried out before the one on which the following results are based, they
investigated rater effects for the constructed response items, and no statistically significant rater effects
were found, so these are not included in the analysis. But also see Section 3.6.
8
Logits are the logarithm of the odds—see Textbox 5.1 for ways to interpret logit units.

24
Respondents Item-Thresholds

1-

s
logit
0- Item A,Threshold 1/2

X -1- Item A,Threshold 0/1

The locations of the MoV item thresholds are graphically summarized in the
Wright map in Figure 1.11, simultaneously showing estimates for both the students and
items on the same (logit) scale. (The item threshold estimates are shown in Appendix
1B.) Moving across the columns from left to right on Figure 1.11, one can see the
following.
(a) The logit scale.
(b) A histogram (on its side) of the respondents’ estimated locations, including the
number of respondents represented by each bar of the histogram.
(c) The location of students at each raw score.
(d) A set of columns, one for each waypoint on the construct map9, with the labels for
each waypoint printed at the bottom, such as “Mov1.” The location of the
threshold for each item is represented by a pair of symbols, “i.k,” where “i”
indicates the item number and “k” specifies the item score, so that, for example
“9.2” is the second threshold location for Item 9—that is, the threshold between
the scores 0 and 1 compared to scores 2 to 4 (the maximum for item 4)10.
(e) A column indicating the bands for each of the waypoints in the construct map. Note
that the bands for MoV2 and Mov3 have been combined in the Wright map—see
the second paragraph below for a discussion of this.
(f) The logit scale (again, for convenience).

9
Recall that the thresholds are comparing lower categories to upper categories, and are labelled using the
minimum upper category, thus the third label “MoV1” represents the thresholds between the lowest two
categories (NL(i)) and the next (NL(ii)), and the categories above (MoV1 to MoV5). There will generally
be one less of these columns than the number of construct level waypoints.
10
Note that an item with the label “i.k” will often be in the column for the (k+1)th waypoint, but this will not
always be the case, as some items may be missing some waypoints. Indeed, this latter is the case for
threshold location 9.2.

25
Note also that the legend for the item labels is shown at the bottom: so, for example, Item
9 is named as “Model2” (i.e., the second question in the Model task) in the legend.

To begin interpreting Figure 1.11, consider first the case for a single item, say, the
“Piano Width” task discussed above—consider specifically the item labelled as “Piano2.”
This item’s thresholds appear in the columns labelled NL(ii), MoV1 and Mov2&3—
corresponding to the item design, it is sensitive only to the bottom part of the construct
map, specifically the Waypoints for NL(i) through to MoV(3). Consider a student located
at the same point as the MoV1 threshold for this item (0.34 logits)—the histogram bar
here is one of the largest, with 172 students at the same location. A student at this
location, would be approximately 50% likely to respond at or below a MoV1 level on the
construct map (see Figure 1.4; i.e., they could identify sources of variability).11

Looking beyond interpreting the findings for a single item, we need to investigate
the consistency of the locations of the thresholds across items. A standard-setting
procedure called construct mapping (Draney & Wilson, 2011) was used to develop
empirical boundaries between the sets of thresholds for each waypoint and thus create
interpretative “bands” on the Wright map. The bands are indicated by the horizontal lines
across Figure 1.11. The bands indicate that the thresholds fall quite consistently into
ordered sets, with a few exceptions, specifically the thresholds for Soil-NL(ii), Piano4-
Mov1 Model3-Mov4. In the initial representations of this Wright map, it was found that
the thresholds for the Waypoints Mov2 and Mov3 were thoroughly mixed. A large
amount of time was spent exploring this, both quantitatively, using the data, and
qualitatively, examining item contents, and talking to curriculum developers and teachers
about the apparent anomaly. The conclusion was that for these two Waypoints, although
there is certainly a necessary hierarchy to their lower ends (i.e., there is little hope for a
student to successfully use a chance-based device such as a spinner to represent a source
of variability (MoV3) if they cannot informally describe such a source (MoV2)) these
two Waypoints can and do overlap quite a bit in the classroom context. Students are still
improving on MoV2 when they are initially starting on Mov3, and they continue to
improve on both at about the same time. Hence, at least formally, while it was decided to
uphold the distinction between Mov2 and MoV3 in terms of content, it also seemed best
to ignore the difference in difficulty of these waypoints, and to label the segment of the
scale (i.e., the relevant band) as “MoV2&3.” Thus, the MoV construct map was
modified as in Figure 1.12.

11
Or for more detail, look at Table 1.1.

26
Figure 1.11 The Wright Map for MoV

27
As noted above the conceptualization of a specific construct map starts off as an
idea mainly focused on the content of the underlying construct, and relates to any extant
literature and other content-related materials concerning that construct. But eventually,
after iteration through the cycle of the four building blocks, it will incorporate both
practical knowledge of how items are created and designed, as well as empirical
information relating to the empirical behavior of items (discussed in detail in Chapters 3
and 4) as well as to the behavior of the set of items as a whole (as represented in the
calibration results, especially the Wright map).

Figure 1.12 The Revised MoV construct map.

MoV5: Account for variability among different runs of model simulations to judge
adequacy of model.

MoV4: Develop emergent models of variability.

MoV3: Use a chance device to represent a source of variability or the total variability
of the system
MoV2: Informally describe the contribution of one or more sources of variability to
variability observed in the system.

MoV1: Identify sources of variability.

The interpretive bands on the Wright map can thus be used as a means of
labelling estimated student locations with respect to the construct map Waypoints NL(i)
to MoV4. For example, a student estimated to be at 1.0 logits (there are 29 students
estimated to be at that point) could be interpreted as being at the point of most actively
learning (specifically, succeeding at the relevant points approximately 50% of the time)
within the construct map waypoints MoV2 and MoV3, that is, being able to informally
describe the contribution of one or more sources of variability to the observed variability
in the system, while at the same time developing a chance device (such as a spinner) to
represent that relationship. The same student would be expected to succeed more
consistently (approximately 75%) at MoV1 (i.e., being able to identify sources of
variability), and succeed much less often (approximately 25%) at MoV4 (i.e., develop an
emergent model of variability). Calculation of these probabilities depends on the
distance between the item and student locations in the logit metric, which is explained in
Chapter 5.

28
1.6.2 Return to the Discussion of Causation and Inference
What are the directions of causation in the BAS?

Looking back to the discussion about the relationship between causation and
inference, previously illustrated in Figure 1.8, we can now elaborate that diagram by
plugging in the four building blocks into their appropriate places (see Figure 1.13). In
this Figure, the arrow of causality goes directly from the construct to the item responses--
it does not go through the outcome space or the calibration model because (presumably)
the construct would have "caused" the responses whether or not the measurer had
constructed a scoring guide and measurement model or not. This sometimes puzzles
people, but indeed it amply displays the distinction between the latent causal link, and the
manifest inferential link. The initial, vague, link (as in Figure 1.8) has been replaced in
Figure 1.13 by a causal link, and, in particular, the (undifferentiated) inference link in
Figure 1.8 has been populated by two important practical tools that we use in measuring
the construct, the outcome space and the calibration model.

Figure 1.13 The “4 building blocks” showing the directions of causality and
inference.

Causality
Construct Items
Map Design

Calibration Outcome
Model Space

Inference

29
1.7 Reporting the Results to the Measurer and Other Users
How can the results from the theoretical empirical development be used to help
measurers interpret the measurements?

There are numerous ways to report the measurements using the approach
described in this chapter, and many are available within the Report generation options in
the BASS application. In this introductory chapter, only one will be featured, but more
will be shown in later chapters (especially Chapter 6).

A report called the “Group Proficiency Report” was generated in BASS for a
classroom set of students from the MoV data set and is shown in Figure 1.14. In this
graph, the MoV scale runs horizontally from left to right (i.e., lowest locations on the
left). The bands representing the waypoints is shown as vertical bars in different shades
of grey (blue in the on-line version), with the labels (NL(i) to MoV5) in the top row and
the logit estimates of the boundaries between them are indicated in the second row (and
also at the bottom). Below that, each individual student is represented in a row, with a
black dot showing their estimated location, and an indication of variation around that
given by the “wings” on either side of each dot. Looking at the dots, one can see that, for
this class, the students range broadly across the three Waypoints NL(ii), MoV1 and
MoV2&3. There are also two outliers12 #13549 and #13555 who are below and above
these three core Waypoints. Thus, a strategy might be envisaged where the teacher
would plan activities suitable for students at these three sophistication levels in
understanding Variation, and also plan to individually talk to the two outliers to see what
is best for each of them. The exact meaning and interpretation of the wings will be left
Chapter 5 to explore in detail, but here we note that there is a group of students (#13553
to #13530) who are straddling MoV1 and Mov2&3, and this will need to be considered
also. These results are also available in tabular format that can be viewed directly in
BASS and/or exported into a grading application.

The other reports that are available in BASS are: (a) the Individual Proficiency
Report (analogous to the Group one, except generated for an individual student), (b) a
report of each student’s raw (unscored) responses to each item, (c) a report of the score
for each student’s responses to each item, and (d) a report on how consistently each
student responded, given their estimated location, which may help teachers understand
individual responses to the items (see Section 6.3.2).

12
Student names are avoided, for privacy.

30
Figure 1.14. A Group Proficiency Report for a set of Students on the MoV construct

1.8 Using the 4 Building Blocks to Develop an Instrument


How can the 4 Building Blocks function as a guide for instrument construction?

The account so far, although illustrated using the ADM example, has been quite
abstract. The reader should not be alarmed by this, as the next four chapters are devoted,
in turn, to each of the four building blocks, and will provide more detail and many
examples of each, across a broad range of contexts and subject matters. The purpose of
this introductory chapter has been simply to orient the reader to what is to come.

Another purpose of this chapter is to get the reader thinking and learning about
the practical process of instrument development. If the reader does indeed want to learn
to develop instruments, then it should be obvious that he or she should be prepared to
read through this section and carry out the exercises and class projects that are described
in the chapters that follow. However, even if practical experience about how to develop
instruments is not the aim of the reader, then this section, and later sections like it, should
still be studied carefully, and the exercises carried out fully. The reason for this is that
learning about measurement without developing an instrument leaves the reader in a very
incomplete state of knowledge—it’s a bit like trying to learn about bike riding, soufflé
cooking, or juggling, by reading about it in a book, or watching a video, without actually
trying it out. We all can appreciate that an essential part of the knowing in these
situations is the doing, and the same is true of measurement: it is sufficiently complicated
and is inherently based on the requirement to balance multiple competing optimality
considerations, so that genuine appreciation for how any principles or knowledge about it
operate can only be mastered through practice. The exercises at the end of each chapter
are intended to be a path towards this knowledge-in-practice—it can be demanding to
actually carry out some of the exercises and will certainly take more time than just

31
reading the book, but carrying these exercises will hopefully bring a sense of satisfaction
in its own right, and enrich the reader’s appreciation of the complexity of measurement.

The four building blocks provide not only a path for inference about a construct
but can also be used as a guide to the construction of an instrument to measure that
construct. The next four chapters are organized according to a development cycle based
on the 4 building blocks—see Figure 1.15. It will start with defining the idea of the
construct as embodied in the construct map (Chapter 2), then move on to developing
tasks and contexts that engage the construct, the items design (Chapter 3). These items
generate responses that are then categorized and scored—that is the outcome space
(Chapter 4). The calibration model is applied to analyzing the scored responses (Chapter
5), and these measures can then be used to reflect back on the success with which one has
measured the construct—bringing one back to the construct map (Chapter 2). In essence
this sequence of building blocks is actually a cycle, a cycle that may need to be repeated
several times. Three later chapters (6, 7 and 8) help with this appraisal process by
focusing on gathering evidence about how the instrument works: on model-fit, reliability
evidence and validity evidence, respectively.

Figure 1.15 The instrument development cycle through the 4 building blocks.

Construct Items
Map

Measurements Item
Scores

As the measurer starts down the path of developing the instrument, they will need
to gather some resources in order to get started. Even before they start developing the
construct map, the topic of Chapter 2, they should initialize two sorts of resources that
will provide continuing support throughout the whole exercise: a literature review and a
set of informants.

Literature review. Every new instrument (or, equally, the redevelopment or


adaptation of an old instrument) must start with an idea—the kernel of the instrument, the
“what” of the “what does it measure?” and the “how” of “how will the measurements be
used?” When this is first being considered, it makes a great deal of sense to look broadly
to establish a dense background of knowledge about the content and the uses of the
instrument. As with any new development, one important step is to investigate (a) the
theories behind the construct and (b) what has been done in the past to measure this

32
particular content—that is, what have been the characteristics of the instrumentation that
were used. The materials in this latter category may not be available in the usual places
that researcher/developers look for the relevant literature. Often, published documents
provide only very few details about how previous instruments have been developed,
especially any steps that did not work out (this is the measurement equivalent of the “file-
drawer problem” in meta-analysis). It may require contacting the authors of the previous
instruments to uncover materials such as these: the measurer is encouraged to try that out,
as it has been my experience that many, if not most instrument developers will welcome
such contacts. Thus, a literature review is necessary, and should be completed before
going too far with other steps (say, alongside Chapter 2, but before commencing the
activities discussed in Chapter 3). However, a literature review will necessarily be
limited to the insights of those who had previously worked in this area, so other steps will
also have to be taken.

Informants. Right at the beginning, the measurer needs to recruit a small set of
informants to help with instrument design. This should include (a) some potential
respondents, where appropriate, who should be chosen to span the usual range of
respondents. Other members would include (b) professionals, teachers/academics and
researchers in the relevant areas, as well as (c) people knowledgeable about measurement
in general and/or measurement in the specific area of interest, and (d) people who are
knowledgeable and reflective about the area of interest and/or measurement in that area,
such as administrators and policy makers, etc. This group (which may change somewhat
in nature over the course of the instrument development), will, at this point, be helpful to
the measurer by discussing their experiences in the relevant area, by criticizing and
expanding on the measurer’s initial ideas, by serving as “guinea pigs” in responding to
older instruments in the area, and by also responding to initial items and item designs.
The information from the informants should overlap that from the literature review but
may also contradict it in parts.

1.9 Resources
History of the Four Building Blocks/BEAR Assessment System approach. For a
very influential perspective on the idea of a construct, see the seminal article by Messick
(1989) referenced earlier. The conceptual link between the construct and the calibration
model was made explicit in two books by Benjamin Wright, which are also seminal for
the approach taken in this book: Wright and Stone (1979) and Wright and Masters
(1981). The origin of the term “Wright map” is discussed in Wilson (2017). The idea of
the construct map was introduced in Wilson & Sloane (2000).

Similar Ideas. The idea of the evidence centered design approach to assessment is
quite parallel—and an early account is given in Mislevy, Steinberg and Almond (2003)—
an account that is integrative is given in Mislevy et al. (2003). A closely-related
approach is termed “Developmental Assessment” by Geoff Masters and his colleagues at
the Australian Council for Educational Research—examples are given in DEETYA
(1997) and Masters & Forster (1996). This is also the basis of the historical approach
taken by the OECD’s PISA project (OECD, 1999), where the euivalent of the construct
map is referred to as the “described variable.” The BAS can be seen as falling into the

33
category of principled assessment design (PAD), and this general approach, as well as
several other examples has been summarized in Ferrara et al. (2016) and Wilson & Tan
(in press).

Aspects and applications of the Four Building Blocks/BEAR Assessment System


approach. The BEAR Assessment System (BAS; Wilson & Sloane, 2000), which is based
on the four building blocks, has been used in other contexts besides the ADM assessment
example given above, which was originally published in Lehrer et al, 2018). Other
publications about the ADM context are Schwartz et al. (2018) and Wilson & Lehrer
(2021). There are many publications about aspects of the BAS giving examples of
construct maps and the BAS across both achievement and attitude domains. A list of
them is given in Appendix 3.

1.10 Exercises and Activities


1. Explain what your instrument will be used for, and why existing instruments will not
suffice.

2. Read about the theoretical background to your construct. Write a summary of the
relevant theory (keep it relatively brief, no more than 5 pages).

3. Investigate previous efforts to develop and use instruments with a similar purpose, and
ones with related, but different, purposes. In many areas there are compendia of such
efforts—for example, in the areas of psychological and educational testing there are
series like the Mental Measurements Yearbook (Carlson et al., 2021)—similar
publications exist in many other areas. Write a summary of the alternatives that are
found, outlining the main points, perhaps in a table (again, keep it brief, no more than 5
pages).

4. Brainstorm possible informants for your instrument construction. Contact several


potential informants and discuss your plans with them—secure the agreement of some of
them to help you out as you make progress.

5. Try to think through the steps outlined above in the context of developing your
instrument, and write down notes about your plans, including a draft timetable. Try to
predict problems that you might encounter as you carry out these steps.

6. Share your plans and progress with others who are engaged in similar efforts (or who
have done so)—discuss what you and they are succeeding on, and what problems have
arisen.

7. Read through Appendix B about the BAS Software (BASS). Make sure you can
access the BASS website. Look around on the website and explore the resources and
materials there.

8. Log into BASS, and explore the Examples included. Choose one you are interested in
and look through the screens under the Construct, Items and Scoring Guide tabs, and
explore the results under the Analysis and Reports tabs.

34
9. If you haven’t already chosen the MoV Example, repeat #8 for that one, and compare
what you find with the tables, figures and results reported in this chapter.

35
Appendix 1A

36
The MoV Outcome Space

37
38
39
Appendix 1B
The MoV Item Difficulty Estimates

Table 1B.1 Item Difficulties for the MoV Items

Item Item Waypoint Item


Number Name Category Difficulty1
2 Building2 1 -0.88
2 Building2 2 0.02
2 Building2 3 1.58
3 Building4 1 -0.60
3 Building4 2 0.23
3 Building4 3 0.68
6 Model1 1 -0.98
6 Model1 2 0.92
6 Model1 3 3.02
7 Model2 1 -0.75
7 Model2 2 0.84
7 Model2 3 1.69
7 Model2 4 3.06
8 Model3 1 -0.93
8 Model3 2 1.05
8 Model3 3 1.48
9 Piano2 1 -0.77
9 Piano2 2 0.34
9 Piano2 3 1.36
10 Piano4 1 -0.54
10 Piano4 2 -0.07
10 Piano4 3 0.21
11 Soil2 1 0.44
11 Soil2 2 0.75
12 Rock2 1 -0.26
12 Rock2 2 0.83
1
In logits.

40
Textbox 1.1
Some Useful Terminology

In this volume, the word instrument is defined as a technique of relating


something we observe in the real world (sometimes called “manifest” or “observed”) to
an attribute that we are measuring that exists only as a part of a theory (sometimes called
“latent” or “unobserved”). This is somewhat broader than the typical usage, which
focuses on the most concrete manifestation of the instrument—the items or questions.
This broader definition has been chosen to expose the less obvious aspects of
measurement. Examples of the kinds of instruments that can be subsumed under the
“construct mapping” framework are shown in this and the next several chapters. Very
generally, it will be assumed that there is a respondent who is the object of measurement
—sometimes the label will be changed depending on an application context, for example
a subject in a psychological context, a student or examinee in education, a patient in a
health context. Also, very generally, it will be assumed that there is a measurer who
seeks to measure something about the respondent; and when the instrument is being
developed, this may be made more specific by referring to the instrument developer.
While reading the text the reader should mainly see himself or herself as the measurer
and/or the instrument developer, but it is always useful to be able to assume the role of
the respondent as well.

41

You might also like