1
Research timeline: Assessing Second Language Speaking
Glenn Fulcher
University of Leicester, United Kingdom
School of Education
Biodata: Glenn Fulcher is Professor of Education and Language Assessment at the University of
Leicester, and Head of the School of Education. He has published widely in the field of language
testing, from journals such as Language Testing, Language Assessment Quarterly, Applied Linguistics
and System, to monographs and edited volumes. His books include Testing second language speaking
(Longman, 2003), Language testing and assessment: An advanced resource book (Routledge 2007),
Practical language testing (Hodder 2010), and the Routledge handbook of language testing
(Routledge 2012). He currently co-edits the Sage journal Language Testing.
Introduction
While the viva voce (oral) examination has always been used in content-based educational assessment
(Latham 1877, p. 132), the assessment of second language speaking in performance tests is relatively
recent. The impetus for the growth in testing speaking during the 19th and 20th Centuries is twofold.
Firstly, in educational settings the development of rating scales was driven by the need to improve
achievement in public schools, and to communicate that improvement to the outside world. Chadwick
(1864) implies that the rating scales first devised in the 1830s served two purposes: providing
information to the classroom teacher on learner progress for formative use, and generating data for
school accountability. From the earliest days, such data was used for parents to select schools for their
children in order to ‘maximize the benefit of their investment’ (Chadwick 1858). Secondly, in military
settings it was imperative to be able to predict which soldiers were able to undertake tasks in the field
without risk to themselves or other personnel (Kaulfers 1944). Many of the key developments in
speaking test design and rating scales are linked to military needs.
2
The speaking assessment project is therefore primarily a practical one. The need for speaking tests has
expanded from the educational and military domain to decision making for international mobility,
entrance to higher education, and employment. But investigating how we make sound decisions based
on inferences from speaking test scores remains the central concern of research. A model of speaking
test performance is essential in this context, as it helps focus attention on facets of the testing context
under investigation. The first such model developed by Kenyon (1992) was subsequently extended by
McNamara (1995), Milanovic & Saville (1996), Skehan (2001), Bachman (2001), and most recently
by Fulcher (2003, p. 115), providing a framework within which research might be structured. The
latter is reproduced here to indicate the extensive range of factors that have been and continue to be
investigated in speaking assessment research, and these are reflected in my selection of themes and
associated papers for this timeline.
3
Figure 1. An expanded model of speaking test performance (Fulcher 2003, p. 115).
Characteristics Training
Charac Rater(s)
Orientation / Rating Scale / Band Construct
Scoring Descriptors definition
philosophy
Focus
Local Score and
performance inferences about
conditions Performance
the test taker
Interlocutor(s)
Task Additional task
Orientation
characteristics or
Interactional
conditions as
Relationship
required for
Goals specific contexts
Interlocutors
Topics Test Taker
Situations
Individual
Difficulty
variables (e.g.,
personality)
Task specific Real-time Abilities /
knowledge or processing capacity capacities on
skills constructs
Decisions and
Consequences
Overviews of the issues illustrated in figure 1 are discussed in a number of texts devoted to assessing
speaking that I have not included in the timeline (Fulcher 2003; Lazaraton 2002; Luoma 2004; Taylor
(ed. 2011). Rather, I have selected publications based on 12 themes that arise from these texts, from
figure 1, and from my analysis of the literature.
4
Themes that pervade the research literature are rating scale development, construct definition,
operationalisation, and validation. Scale development and construct definition are inextricably bound
together because it is the rating scale descriptors that define the construct. Yet, rating scales are
developed in a number of different ways. The data-based approach requires detailed analysis of
performance. Others are informed by the views expert judges using performance samples to describe
levels. Some scales are a patchwork quilt created by bundling descriptors from other scales together
based on scaled teacher judgments. How we define the speaking construct and how we design the
rating scale descriptors are therefore interconnected. Design decisions therefore need to be informed
by testing purpose and relevant theoretical frameworks.
Underlying design decisions are research issues that are extremely contentious. Perhaps these can be
presented in a series of binary alternatives to show stark contrasts, although in reality there are clines
at work.
Specific Purposes Tests vs. Generalizability. Should the construct definition and task design be related
to specific communicative purposes and domains? Or is it possible to produce test scores that are
relevant to any and every type of real-world decision that we may wish to make? This is critical not
least because the more generalizable we wish scores to be, the more difficult it becomes to select test
content.
Psycholinguistic Criteria vs. Sociolinguistic Criteria. Closely related to the specific purpose issue is
the selection of scoring criteria. Usually, the more abstract or psycholinguistic the criteria used, the
greater the claims made for generalizability. These criteria or ‘facilities’ are said to be part of the
construct of speaking that is not context dependent. These may be the more traditional constructs of
‘fluency’ or ‘accuracy’, or more basic observable variables related to automaticity of language
processing, such as response latency or speed of delivery. The latter are required for the automated
assessment of speaking. Yet, as the generalizability claim grows, the relationship between score and
5
any specific language use context is eroded. This particular antithesis is not only a research issue, but
one that impacts upon the commercial viability of tests; it is therefore not surprising that from time to
time the arguments flare up, and research is called into the service of confirmatory defence (Chun
2006; Downey et al. 2008).
Normal Conversation vs. Domain Specific Interaction. It is widely claimed that the ‘gold standard’ of
spoken language is ‘normal’ conversation, loosely defined as interactions in which there are no power
differentials, so that all participants have equal speaking rights. Other types of interaction are
compared to this ‘norm’ and the validity of test formats such as the interview are brought into
question (e.g. Johnson 2001). But we must question whether ‘friends chatting’ is indeed the ‘norm’ in
most spoken interaction. In higher education, for example, this kind of talk is very rare, and scores
from simulated ‘normal’ conversations are unlikely to be relevant to communication with a professor,
accommodation staff, or library assistants. Research that describes the language used in specific
communicative contexts to support test design is becoming more common, such as that in academic
contexts to underpin task design (Biber 2006).
Rater Cognition vs. Performance Analysis. It has become increasingly common to look at ‘what raters
pay attention to’. When we discover what is going on in their heads, should it be treated as construct
irrelevant if it is at odds with the rating scale descriptors and/or an analysis of performance on test
tasks? Or should it be used to define the construct and populate the rating scale descriptors? Do all
raters bring the same analysis of performance to the task? Or are we merely incorporating variable
degrees of perverseness that dilutes the construct? The most challenging question is perhaps: Are rater
perceptions at odds with reality?
Freedom vs. Control. Left to their own devices, raters tend to vary in how they score the same
performance. The variability decreases if they are trained; and it decreases over time through the
process of social moderation. With repeated practice raters start to interpret performances in the same
way as their peers. But when severed from the collective for a period of time, judges begin to reassert
6
their own individuality, and disagreement rises. How do we identify and control this variability? This
question now extends to interlocutor behaviour, as we know that interlocutors provide differing levels
of scaffolding and support to test takers. This variability may lead to different scores for the same test
taker depending on which interlocutor they work with. Much work has been done in the co-
construction of speech in test contexts. And here comes the crunch. For some, this variation is part of
a richer speaking construct and should therefore be built into the test. For others, the variation
removes the principle of equality of experience and opportunity at the moment of testing, and
therefore the interlocutors should be controlled in what they say. In face-to-face speaking tests we
have seen the growth of the interlocutor frame to control speakers, and proponents of indirect
speaking tests claim that the removal of an interlocutor eliminates subjective variation.
Publications selected to illustrate a timeline are inevitably subjective to some degree, and the list
cannot be exhaustive. My selection avoids clustering in particular years or decades, and attempts to
show how the contrasts and themes identified play out historically. You will notice that themes H and
I are different from the others in that they are about particular methodologies. I have included these
because of their pervasiveness in speaking assessment research, and may help others to identify key
discourse or multi-faceted Rasch measurement studies (MFRM). What I have not been able to cover
is the assessment of pronunciation and intonation, or the detailed issues surrounding semi-direct (or
simulated) tests of speaking, both of which require separate timelines. Finally, I am very much aware
that the assessment of speaking was common in the United Kingdom from the early 20th Century. Yet,
there is sparse reference to research outside the United States in the early part of the of the timeline.
The reason for this is that apart from Roach (1945, reprinted as an appendix in Weir, Vidaković &
Galaczi (2013) (eds.) there is very little published research from Europe (Fulcher 2003, p. 1). The
requirement that research is in the public domain for independent inspection and critique was a
criterion for selection in this timeline. For a retrospective interpretation of the early period in the
United Kingdom with reference to unpublished material and confidential internal examination board
reports to which we do not have access, see Weir & Milanovic (2003) and Vidaković & Galaczi
(2013).
7
Themes
A. Rating scale development
B. Construct definition and validation
C. Task design and format
D. Specific purposes testing and generalizability
E. Reliability and rater training
F. The native speaker criterion
G. Washback
H. Discourse analysis
I. Multi-faceted Rasch Measurement (MFRM)
J. Interlocutor behaviour and training
K. Rater cognition
L. Test-taker characteristics
References
Bachman, L. F. (2001). Speaking as a realization of communicative competence. Paper presented at
the meeting of the American Association of Applied Linguistics. St. Louis, Missouri, February.
Biber, D. (2006). University language. A corpus-based study of spoken and written registers.
Amsterdam: John Benjamins.
Chadwick, E. (1858). On the economical, social, educational, and political influences of competitive
examinations, as tests of qualifications for admission to the junior appointments in the public service.
Journal of the Statistical Society of London 21.1, 18 – 51.
8
Chadwick, E. (1864). Statistics of educational results. Museum: A Quarterly Magazine of Education,
Literature and Science 3, 479-484.
Chun, C. W. (2006). Commentary: An Analysis of a Language Testing for Employment: The
Authenticity of the PhonePass Test. Language Assessment Quarterly 3.3, 295 – 306.
Downey, R., Farhady, H., Present-Thomas, R., Suzuki, M. & Van Moere, A. (2008). Evaluation of the
Usefulness of the Versant for English Test: A Response. Language Assessment Quarterly 5.2, 160 –
167.
Fulcher, G. (2003). Testing second language speaking. Harlow: Longman/Pearson Education.
Johnson, M. (2001). The Art of Non-conversation. A re-examination of the validity of the Oral
Proficiency Interview. New Haven and London: Yale University Press.
Kaulfers, W. V. (1944). War-time developments in modern language achievement tests. Modern
Language Journal 28, 136 – 150.
Kenyon, D. (1992). Introductory remarks at symposium on development and use of rating scales in
language testing. Paper delivered at the 14th Language Testing Research Colloquium, Vancouver,
March.
Latham, H. (1877). On the action of examinations considered as a means of selection. Cambridge:
Dighton, Bell and Company.
Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. Cambridge:
Cambridge University Press.
9
Luoma, S. (2004). Assessing second language speaking. Cambridge: Cambridge University Press.
McNamara, T. F. (1995). Modelling performance: Opening Pandora’s Box. Applied Linguistics 16.2,
159 – 179.
Milanovic, M. & Saville, N. (1996). Introduction. In Milanovic, M. (ed.), Performance testing,
cognition and assessment (pp. 1 – 17). Cambridge: Cambridge University Press.
Skehan, P. (2001). Tasks and language performance assessment. In Bygate, M., Skehan, P. & Swain,
M. (eds.), Researching pedagogic tasks: Second language learning, teaching and testing. (pp. 167 –
185). London: Longman.
Taylor, L. (2011). Examining Speaking. Research and practice in assessing second language
speaking. Cambridge: University of Cambridge Press.
Weir, C. & Milanovic, M. (2003). (eds.), Continuity and innovation: Revising the Cambridge
Proficiency in English Examination 1913 – 2002. Cambridge: Cambridge University Press.
Weir, C. J., Vidaković, I. & Galaczi, E. D. (2013). (eds.), Measured constructs. A history of
Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University Press.
Vidaković, I. & Galaczi, E. D. (2013). The measurement of speaking ability 1913 – 2012. In Weir, C.
J., Vidaković, I. & Galaczi, E. D. (eds.), Measured constructs. A history of Cambridge English
language examinations 1913 – 2012. Cambridge: Cambridge University Press.
10
Year References Annotations Theme
1864 Chadwick, E. (1864). Statistics of The earliest record of an attempt to A
educational results. Museum: A assess second language speaking dates
Quarterly Magazine of Education, to the first few years after Rev. George
Literature and Science 3, 479- Fisher became Headmaster of the
484. Greenwich Royal Hospital School in
1834. In order to improve and record
Also see discussion in: academic achievement, he instituted a
Cadenhead, K. & Robinson, R. ‘Scale Book’, which recorded
(1987). Fisher’s ‘Scale Book’: An performance on a scale of 1 to 5 with
Early Attempt at Educational quarter intervals. A scale was created for
Measurement. Educational French as a second language, with
Measurement: Issues and typical speaking prompts to which boys
Practice 6.4, 15 – 18. would be expected to respond at each
level. The Scale Book has not survived.
1912 Thorndike, E. L. (1912). The Scales of various kinds were developed A, B
measurement of educational by social scientists like Galton and
products. The School Review 20.5, Cattell towards the end of the 19th
289–299. Century, but it was not until the work of
Thorndike in the early 20th Century that
the definition of each point on an equal
interval scale was revived. With
reference to speaking German, he
suggested that performance samples
should be attached to each level of a
scale, along with a descriptor that
11
summarizes the ability being tested.
1920 Yerkes, R. M. (1920). What Yerkes describes the development of the A, B, C,
psychology contributed to the first large-scale speaking test for D
war. In R. M. Yerkes (ed.), The military purposes in 1917. It was
new world of science: designed to place army recruits into
Its development during the war. language development battalions. It
New York, NY: The Century Co, consisted of a verbal section and a
364 – 389. performance section (following
instructions), with tasks linked to scale
Also see discussion in: level by difficulty. Although the
Fulcher, G. (2012). Scoring development of the test is not described,
performance tests. In Fulcher, G. the generic approach is outlined, and
& Davidson, F. (eds.), The involved the identification of typical
Routledge handbook of language tasks from the military domain that were
testing. London and New York: piloted in test conditions. It is arguably
Routledge, 378 – 392. the case that this was the first English
for Specific Purposes test based on
domain specific criteria. In addition,
there was clearly an element of domain
analysis to support Criterion-referenced
assessment.
1944 Kaulfers, W. V. (1944). War-time The interwar years saw a rapid growth in A, B, D
developments in modern language large-scale assessment that relied on the
achievement tests. Modern multiple-choice item for efficiency. In
Language Journal, 28, 136 – 150. the Second World War Kaulfers quickly
realized that these tests could not
Also see discussion in: adequately predict ability to speak in
12
Velleman, B. L. (2008). The potentially life-threatening contexts.
‘scientific linguist’ goes to war: Teaching and assessment of speaking
the United States A.S.T. program was quickly geared towards the military
in foreign languages. context once again. Kaulfers presents
Historiographia Linguistica 35, scoring criteria according to the scope
385–416. and quality of performance. However,
all descriptors are generic and not
domain specific.
1945 Roach, J. O. (1945). Some Roach was among the first to E
problems of oral examinations in investigate rater reliability in speaking
modern languages. An tests. He was concerned primarily with
experimental approach based on maintaining ‘standards’, by which he
the Cambridge examinations in meant that examiners would agree on
English for Foreign Students. which test takers were awarded a pass, a
University of Cambridge good pass, and a very good pass, on the
Examinations Syndicate: Internal Certificate of Proficiency in English. He
report circulated to oral examiners was the first to recommend what we
and local representatives for these now call ‘social moderation’ (see
examinations. (Reprinted as MISLEVY 1992) – familiarization with
facsimile in Weir et al. 2013) the system through team work, which
results in agreement evolving over time.
1952/ Foreign Service Institute. Little progress was made in testing A, B, C,
1958 (1952/1958). FSI Proficiency second language speaking until the D, F
Ratings. Washington D.C.: outbreak of the Korean War in 1950.
Foreign Service Institute. The Foreign Service Institute (FSI) was
established, and the first widely used
Also see discussion in: semantic-differential rating scale put
13
Sollenberger, H. E. (1978) into use in 1952. This operationalized
Development and current use of the ‘native speaker’ construct at the top
the FSI oral interview test. In band (level six). With the Vietnam war
Clark, J. L. D. (ed.), Direct testing on the horizon, a decision was taken to
of speaking proficiency: Theory register the language skills of US
and application. Princeton, NJ: diplomatic and military personnel. Work
Educational Testing Service, began to expand the FSI scale by adding
1–12. verbal descriptors at each of the six
levels from zero proficiency to native
speaker, and to include multiple holistic
traits. This went hand in hand with the
creation of the Oral Proficiency
Interview (OPI), which was a mix of
interview, prepared dialogue, and
simulation. The wording of the 1958 FSI
scale and the tasks associated with the
OPI have been copied into many other
testing systems still in use.
1967 Carroll, J. B. (1967). The foreign Despite little validation evidence the E, G
language attainments of language FSI/ILR approach became popular in
majors in the senior year: A education because of its face validity,
survey conducted in US colleges inter-rater reliability through social
and Universities. Foreign moderation, and perceived coherence
Language Annals 1.2, 131 – 151. with new communicative teaching
methods. Carroll’s study of 1967
showed that the military system was not
sensitive to language acquisition in an
14
educational context, and hence was
demotivating. It would be over a decade
before this research had an impact on
policy.
1979 Strength Through Wisdom: A Further impetus to extend speaking
Critique of U.S. Capability. A assessment in educational settings came
Report to the President from the from a report submitted to President
President's Commission on Carter on shortcomings in the US
Foreign Language and military because of lack of foreign
International Studies. (1979). language skills. It is not coincidental that
Wahington DC: US Government in the same year attention was drawn to
Printing Office. a study published by Carroll in 1967.
The American Council on the Teaching
of Foreign Languages (ACTFL) was
given the task of revising the FSI/ILR
scales for wider use.
1979 Adams, M. L. & Frith, J. R. As part of the ACTFL research into new A, C, E,
(1979). Testing kit: French and rating scales the first testing kits were G
Spanish. Washington DC: developed for training and assessment
Department of State and the purposes in US Colleges. The articles
Foreign Service Institute. and resources in Adams & Frith
provided a comprehensive guide for
raters of the Oral Proficiency Interview
for educational purposes.
1980 Adams, M. L. (1980). Five co- Adams conducted the first structural B
occurring factors in speaking validation study designed to investigate
proficiency. In Frith, J. R. (ed.), which of the five FSI subscales
15
Measuring spoken language discriminated between learners at each
proficiency. Washington DC: proficiency level. The study was not
Georgetown University Press, 1 – theoretically motivated, and no patterns
6. could be discerned in the data.
1980 Reves, T. (1980). The group-oral Reves questioned whether the OPI could C
test: an experiment. English generate ‘real-life conversation’ and
Teachers Journal 24, 19 – 21. began experimenting with group tasks to
generate richer speaking samples.
1981 Bachman, L. F. & Palmer, A. S. The first construct validation studies B
(1981). The construct validity of were carried out in the early 1980s,
the FSI oral interview. Language using the multitrait-multmethod
Learning 31.1, 67 – 86. technique and confirmatory factor
analysis. These demonstrated that the
FSI OPI loaded most heavily on the
speaking trait, and lowest of all methods
on the method trait. These studies
concluded that there was significant
convergent and divergent evidence for
construct validity in the OPI.
1983 Lowe, P. (1983). The ILR oral In the 1960s the FSI approach to A, C, D
interview: origins, applications, assessing speaking was adopted by the
pitfalls, and implications. Die Defense Language Institute, the Central
Unterrichtspraxis 16, 230 – 244. Intelligence Agency, and the Peace
Corp. In 1968 the various adaptations
were standardized as the Interagency
Language Roundtable (ILR), which is
still the accepted tool for the
16
certification of second language
speaking proficiency throughout the
United States military, intelligence and
diplomatic services
(http://www.govtilr.org/). Via the Peace
Corp it spread to academia, and the
assessment of speaking proficiency
worldwide. It also provides the basis for
the current NATO language standards,
known as STANAG 6001.
1984 Liskin-Gasparro, J. E. (1984). The Following the publication of Strength A, B
ACTFL Proficiency Guidelines: Through Wisdom and the concerns
Gateway to testing and raised by Carroll’s 1967 study, the
curriculum. Foreign Language ACTFL Guidelines were developed
Annals 17.5, 475 – 489. throughout the 80s, with preliminary
publications in 1982, and the final
Guidelines issued in 1986 (revised
1999). Levels from 0 to 5 were broken
down into subsections, with finer
gradations at lower proficiency levels.
Level descriptors provided longer prose
definitions of what could be done at
each level. New constructs were
introduced at each level, drawing on
new theoretical models of
communicative competence of the time,
particularly those of Canale and Swain.
17
These included discourse competence,
interaction, and communicative
strategies.
1985 Lantolf, J. P. & Frawley, W. Lantolf and Frawley were among the A, B
(1985). Oral proficiency testing: first to question the ACTFL approach.
A critical analysis. Modern They claimed the scales were
Language Journal 69.4, 337 – ‘analytical’ rather than ‘empirical’,
345. depending on their own internal logic of
non-contradiction between levels. The
claim that the descriptors bear no
relationship to how language is acquired
or used set off a whole chain of research
into scale analysis and development.
1986 Kramsch, C. J. (1986). From Kramsch’s research into interactional B
language proficiency to competence spurred further research into
interactional competence. Modern task types that might elicit interaction,
language journal 70.4, 366 – 372. and the construction of ‘interaction’
descriptors for rating scales. This
research had a particular impact on
future discourse related studies by HE &
YOUNG (1998).
1986 Bachman, L. F. and Savignon, S. This very influential paper questioned B, D, F
(1986). The evaluation of the use of the native speaker to define
communicative language the top level of a rating scale, and the
proficiency: a critique of the notion of zero proficiency at the bottom.
ACTFL Oral Interview. Modern Secondly, they questioned reference to
Language Journal 79, 380 – 390. context within scales as confounding
18
constructs with test method facets,
unless the test is for a defined ESP
setting. This paper therefore set the
agenda for debates around score
generalizability, which we still wrestle
with today.
1987 Fulcher, G. (1987). Tests of oral Using discourse analysis of native A, B, H
performance: the need for data- speaker interaction, this paper provided
based criteria. English Language the first evidence that rating scales did
Teaching Journal 41.4, 287 - 291 not describe what typically happened in
naturally occurring speech, and
advocated a data-based approach to
writing descriptors and constructing
scales. This was the first use of
discourse analysis to understand under-
specification in rating scale descriptors,
and was expanded into a larger research
agenda (see FULCHER 1996).
1989 Van Lier, L. (1989). Reeling, In another discourse analysis study, Van B, H
writhing, drawling, stretching, and Lier showed that interview language
fainting in coils: Oral proficiency was not like ‘normal conversation’.
interviews as conversation. Although the work of finding formats
TESOL Quarterly 23.3, 489 – that encouraged ‘conversation’ had
508. started with REVES (1980) and
colleagues in Israel, this paper
encouraged wider research in the area.
1991 Linacre, J. M. (1991). FACETS Rater variation had been a concern since E, I
19
computer programme for many- the work of Roach during the war, but
faceted Rasch measurement. only with the publication of Linacre’s
Chicago, IL: Mesa Press. FACETS did it become possible to
model rater harshness/leniency in
relation to task difficulty and learner
ability. MFRM remains the standard tool
for studying rater behaviour today and
test facets today, as in the studies by
LUMLEY & MCNAMARA (1995), and
BONK & OCKEY (2003).
1991 Alderson, J. C. (1991). Bands and Based on research driving the IELTS A
scores. In J. C. Alderson & B. revision project, Alderson categorized
North (eds.), Language Testing in rating scales as use-oriented, rater-
the 1990s. London: Modern oriented, and constructor-oriented.
English Publications and the These categories have been useful in
British Council, 71 – 86. guiding descriptor content with audience
in mind.
1992 Young, R. & Milanovic, M. An early and significant use of discourse B, C, H,
(1992). Discourse variation in oral analysis to characterize the interaction of L
proficiency interviews. Studies in test takers with interviewers in the First
Second Language Acquisition Certificate Test of English. Discourse
14.4, 403 – 424. structure was demonstrated to be related
to examiner, task and gender variables.
1992 Douglas, D. & Selinker, L. Douglas & Selinker show that a A, B, D
(1992). Analyzing Oral discipline specific test (chemistry) is a
Proficiency Test performance in better predictor of domain specific
general and specific purpose performance than a general speaking
20
contexts. System 20.3, 317 – 328). test. In this and a series of publications
on ESP testing they show that reducing
generalizability by introducing context
increases score usefulness. This is the
other side of the coin to BACHMAN &
SAVIGNON’S (1986) generalizability
argument.
1992 Ross, S. & Berwick, R. (1992). Reacting to critiques of the OPI from B, C, H,
The discourse of accommodation VAN LIER (1989), LANTOLF & J
in oral proficiency interviews. FRAWLEY (1985; 1988), and others,
Studies in Second Language Ross & Berwick undertook discourse
Acquisition 14.1, 159 – 176. analysis of OPIs to study how
interviewers accommodated to the
discourse of candidates. They concluded
that the OPI had features of both
interview and conversation. However, it
also raised the question of how
interlocutor variation might result in test
takers being treated differentially. This
sparked a chain of similar research by
scholars such as LAZARATON (1996).
1992 Mislevy, R. J. (1992). Linking LOWE (1983; 1987) and others had E
Educational Assessments. argued that the meaning of descriptors
Concepts. Issues. Methods and was socially acquired. In this publication
Prospects. Princeton NJ: the term ‘social moderation’ was
formalized. NORTH (1998) and the
Educational Testing Service.
Council of Europe have taken this
21
concept and made it central to the
project of using the Common European
Framework of Reference (CEFR) scales
as a European-wide lens for viewing
speaking proficiency.
1995 Chalhoub-Deville, M. (1995). Chalhoub-Deville investigated the A, B, E
Deriving oral assessment scales inter-relationship of diverse tasks and
across different tests and rater raters using multidimensional scaling to
groups. Language Testing 12.1, identify components speaking
16 – 33. proficiency that were being assessed.
She found that these varied by task and
rater group, and therefore called for the
construct to be defined anew for each
task x rater combination. The issue at
stake is whether the construct ‘exits’
separately from those who make
judgments and the facets of the test
method.
1995 Lumley, T. and McNamara, T. Rater variability is studied across time E, I
(1995). Rater characteristics and using FACETS, showing that there is
rater bias: implications or considerable variation in harshness
training. Language Testing 12.10, irrespective of training. The researchers
54 – 71. question the use of single ratings in
high-stakes speaking tests, and
recommend the use of rater calibrations
to provide training feedback or adjust
scores.
22
1995 Upshur, J. & Turner, C. (1995). The paper in which Upshur & Turner A, B, C,
Constructing rating scales for introduce Empirically-derived binary- D, K
second language tests. English choice boundary-definition scales
Language Teaching Journal 49.1, (EBB). These address the long-standing
3 – 12. concern over a-priori scale development
outlined by LANTOLF & FRAWLEY
(1985), and start to tie decisions to
specific examples of performance as
recommended by FULCHER (1987).
The scales are task specific rather than
generic. The methodology has specific
impact on later studies like those of
POONPON (2010).
1996 McNamara, T. (1996). Measuring The research around the development of A, B, C,
Second Language Performance. the Occupational English Test (OET) for D
Harlow: Longman. health professionals is described. This is
a specific purpose test with a clearly
specified audience, and scores from this
instrument are shown to be more reliable
and valid for decision making than
generic English tests.
1996 Fulcher, G. Testing tasks: issues Building on REVES (1980) and others, C, G
in task design and the group oral. this study compared a group oral (3
Language Testing 13.1, 23 – 51. participants) and two interview-type
tasks. Discourse was more varied in the
group task, and participants reported a
preference for working in a group with
23
other test-takers.
1996 Fulcher, G. (1996). Does thick Based on work conducted since A, B, C,
description lead to smart tests? A FULCHER (1987), primarily an D, H
data-based approach to rating unpublished PhD project, this paper
scale construction. Language describes the research underpinning the
Testing 13.2, 208 - 238. design of data-based rating scales. The
methodology employs discourse analysis
of speech samples produce scale
descriptors. The use of the resulting
scale is compared with generic a-priori
scales. Using discriminant analysis the
data-based scores are found to be more
reliable, and using MFRM rater
variation is significantly decreased. The
data-based approach therefore solves the
problems identified by researchers like
LUMLEY & MCNAMARA (1995). The
study also generated the Fluency Rating
Scale descriptors, which were used as
anchor items in the CEFR project.
1996 Lazaraton, A. (1996). Interlocutor In the ROSS & BERWICK (1992) B, H, J
support in oral proficiency tradition, and inspired by VAN LIER,
interviews. The case of CASE. Lazaraton identifies 8 kinds of support
Language Testing 13.2, 151 – provided by a rater/interlocutor in an
172. OPI. She concludes that the variation is
problematic, and calls for additional
rater training and possibly the use of an
24
‘interlocutor support scale’ as part of the
rating procedure.
1996 Pollitt, A. & Murray, N. L. The use of Thurstone’s Paired B, K
(1996). What raters really pay Comparisons, and Kelly’s Repertory
attention to. In Milanovic, M. & Grid Technique, to investigate how
Saville, N. (eds.), Performance raters use rating scales and what they
testing, cognition and assessment. notice in candidate spoken
Selected papers from the 15th performances. The research showed
Language Testing Research raters bring their own conceptual
Colloquium, Cambridge and baggage to the rating process, but used
Arnhem. Studies in Language constructs such as discourse,
Testing 3. Cambridge: Cambridge sociolinguistic, and grammatical
University Press. competence, as well as fluency and
‘naturalness’.
1997 McNamara, T. (1997). Modelling Speaking had generally been B
performance: Opening Pandora’s characterized in cognitive terms as traits
Box. Applied Linguistics 18.4, resident in the speaker being assessed.
446 – 465. Building on the work of KRAMSCH
(1986) and others, McNamara showed
that interaction implied the co-
construction of speech, and argued that
in social contexts there was shared
responsibility for performance. The
question of shared responsibility, the
role of the interlocutor, become active
areas of research.
1998 Young, R. & He, A. W. (1998) An important collection of research B, C, H
25
(eds.), Talking and testing. papers analysing the discourse of test-
Discourse approaches to the taker speech in speaking tests. The
assessment of oral proficiency. speaking test is characterized as an
Amsterdam: John Benjamins. ‘interactive practice’ co-constructed by
the participants.
1998 North, B. & Schneider, G. (1998). This paper describes the measurement- A, I
Scaling descriptors for language driven approach to scale development as
proficiency scales. Language embodied in the CEFR. Descriptors
Testing 15.2, 217 – 262. from existing speaking scales are
extracted from context and scaled using
MFRM using teacher judgments as data.
1999 Jacoby, S. & McNamara, T. In two studies, Jacoby & McNamara B, K
(1999). Locating competence. discovered that the linguistic criteria
English for Specific Purposes used by applied linguists to rate
18.3, 213 – 241. speaking performance did not capture
the kind of communication valued by
subject specialists. They recommended
studying ‘indigenous criteria’ to expand
what is valued in performances. This
work has impacted on domain specific
studies, such as Fulcher et al. 2011. It
also raises serious questions about
psycholinguistic approaches such as
those advocated by VAN MOERE
(2012).
2002 Young, R. (2002). Discourse A careful investigation of the ‘layers’ of B, C, H
approaches to oral language discourse in naturally occurring speech
26
assessment. Annual Review of and test tasks. This is combined with a
Applied Linguistics 22, 243 – 262. review of various approaches to testing
speaking, with an indication of which
test formats are likely to elicit the most
useful speech samples for rating.
2002 O’Sullivan, B., Weir, C. J., & A methodological study to compare the B, H
Saville, N. (2002). Using ‘informational and interactional
observation checklists to validate functions’ produced on speaking test
speaking-test tasks. Language tasks with those the test designer
Testing 19.1, 33 – 56. intended to elicit. The instrument
provided to be unwieldy and
impractical, but the study established the
important principle for examination
boards that evidence of congruence
between intention and reality is an
important aspect of construct validation.
2003 Brown, A. (2003). Interviewer A much quoted study into variation in B, H, I,
variation and the co-construction the speech of the same test taker with J
of speaking proficiency. two different interlocutors. Brown also
Language Testing 20.1, 1 – 25. demonstrated that scores also varied,
although not by as much as one may
have expected. Builds on ROSS &
BERWICK (1992), LAZARATON
(1996) and MCNAMARA (1996).
Raises the critical issue of whether
variation should be allowed because it is
part of the construct, or controlled
27
because it leads to inequality of
opportunity.
2003 Fulcher, G. & Marquez-Reiter, R. An investigation into the effects of task B, C, H
(2003). Task difficulty in speaking features (social power and level of
tests. Language Testing 20.3, 321 – imposition) and L1 cultural background,
344. on task difficulty and score variation.
Like BROWN (2003) it was discovered
that although significant variation
occurred when extreme conditions were
used, effect sizes were not substantial.
2003 Bonk, W. J. & Ockey, G. J. (2003). Using FACETS, the researchers B, E, I
A many-facet Rasch analysis of the investigated variability due to test taker,
second language group oral prompt, rater, and rating categories. Test
discussion task. Language Testing taker ability was the largest facet.
20.1, 89 – 110. Although there was evidence of rater
variability this did not threaten validity,
and indicated that raters became more
stable in their judgments over time. This
adds to the evidence that socialization
over time has an impact on rater
behaviour.
2005 Cumming, A., Grant, L., An important prototyping study. Pre- B, C, K
Mulcahy-Ernt, P., & Powers, D. operational tasks were shown to experts
E. (2005). A teacher-verification who judge whether they represent the
study of speaking and writing kinds of tasks that students would
prototype tasks for a new TOEFL undertake at University. They are also
Test. TOEFL Monograph No. presented with their own student’s
28
MS-26. Princeton, NJ: responses to the tasks and asked whether
Educational Testing Service. these are ‘typical’ of their work. The
study shows that test development is a
research-led activity, and not merely a
technical task. Design decisions and the
evidence for those decisions are part of a
validation narrative.
2007 Berry, V. (2007). Personality Based on many years of research into B, C, L
differences and oral test personality and speaking test
performance. Frankfurt: Peter performance, Berry shows that levels of
Lang. introversion and extroversion impact on
contributions to conversation in paired-
and group-formats, and results in
differential score levels when ability is
controlled for.
2008 Galaczi, E. D. (2008). Peer-peer A discourse analytic study of the paired B, C, H
interaction in a speaking test: The test format. The research identified three
case of the First Certificate in interactive patterns in the data:
English examination. Language ‘collaborative’, ‘parallel’ and
Assessment Quarterly 5.2, 89 – ‘asymmetric’. Tentative evidence is also
119. presented to suggest that there is a
relationship between scores on an
‘Interactive Communication’ rating
scale.
2009 Ockey, G. (2009). The effects of Building on BERRY (2007), Ockey B, C, L
group members’ personalities on a investigates the effect of levels of
test taker’s L2 group oral ‘assertiveness’ on speaking scores in a
29
discussion test scores. Language group oral test, using MANCOVA
Testing 26.2, 161 – 186. analyses. Assertive students are found to
have lower scores when placed in all
assertive groups, and higher scores when
placed with less assertive participants.
The scores of non-assertive students did
not change depending on group makeup.
The results differ from BERRY,
indicating that much more research is
needed in this area.
2010 Poonpon, K. (2010). Expanding a A study that brings together the EBB A, B, H,
Second Language Speaking approach of UPSHUR & TURNER with K
Rating scale for Instructional the data-based approach of FULCHER
Assessment Purposes. Spaan (1996) to create a rich data-based EBB
Fellow Working Papers in Second for use with TOEFL iBT tasks. In the
or Foreign Language Assessment process the nature of the academic
8, 69 – 94. speaking construct is further explored
and defined.
2011 Fulcher, G., Davidson, F. & Like POONPON (2010), this study A, B, H
Kemp, J. (2011). Effective rating brings together UPSHUR & TURNER’S
scale development for speaking (1995) EBB and FULCHER’S (1996)
tests: Performance Decision data-based approach in the context of
Trees. Language Testing 28.1, 5 - service encounters. It also incorporates
29. indigenous insights following JACOBY
& MCNAMARA (1999). It describes
interaction in service encounters through
a performance decision tree that focuses
30
rater attention on observable criteria
related to discourse and pragmatic
constructs.
2011 Frost, K., Elder, C. & Integrated task types have become A, B, C
Wigglesworth, G. (2011). widely used since their incorporation
Investigating the validity of an into TOEFL iBT. Yet, little research has
integrated listening-speaking task: been carried out into the use of source
A discourse-based analysis of test material in spoken responses, or how the
takers’ oral performances. integrated skill can be described in
Language Testing 29(3), 345 – rating scale descriptors. The
369. ‘integration’ remains elusive. In this
study a discourse approach is adopted
following ideas in DOUGLAS &
SELINKER (1992) and FULCHER
(1996) to define content related aspects
of validity in integrated task types. The
study provides evidence for the
usefulness of integrated tasks in
broadening construct definition.
31
2011 May, L. (2011). Interactional Following KRAMSCH (1986), B, C, K
Competence in a Paired Speaking MCNAMARA (1997) and YOUNG
Test: Features Salient to Raters. (2002), May problematizes the notion of
Language Assessment Quarterly the speaking construct in a paired
8.2, 127 – 145. speaking test. However, she attempts to
deal with the problem of how to award
scores to individuals by looking at how
raters focus on features of the speech of
individual participants. The three
categories of interpretation:
understanding interlocutor’s message,
responding appropriately, and using
communicative strategies, are not as
important as the attempt to disentangle
the individual from the event, while
recognizing that discourse is co-
constructed.
2011 Nakatsuhara, F. (2011). Effects of Building on BONK & OCKEY (2003) B, H
test-taker characteristics and the and other research into the group
number of participants in group speaking test, Nakatsuhara used
oral tests. Language Testing 28.4, conversation analysis to investigate
483 – 508. group size in relation to proficiency
level and personality type. She
discovered that more proficient
extroverts talked more and initiated
topic more when in groups of 4 than in
groups of 3. However, proficiency level
32
resulted in more variation in groups of 3.
With reference to GALACZI (2008), she
concludes that groups of 3 are more
collaborative.
2012 Van Moere, A. (2012). A Very much against the trend, Van B, C
psycholinguistic approach to oral Moere makes a case for a return to
language assessment. Language assessing psycholinguistic speech
Testing 29.1, 325 – 344. ‘facilitators’, related to processing
automaticity. These include response
latency, speed of speech, length of
pauses, reproduction of syntactically
accurate sequences, with appropriate
pronunciation intonation and stress.
Task types are sentence repetition and
sentence building. This approach is
driven by an a-priori decision to use an
automated scoring engine to rate speech
samples, and the validation argument
points to the objective nature of the
decisions made in comparison with
interactive human scored tests, which
are claimed to be unreliable and contain
too much construct-irrelevant variance.
This is an exercise in reductionism par
excellence, and is likely to reignite the
debate on prediction to domain
performance from ‘atomistic’ features
33
that last raged in the early
communicative language testing era.
2012 Tan, J. Mak, B, & Zhou, P. The application of fuzzy logic to our E, J
(2012). Confidence scoring of understanding of how raters score
speaking performance: How does performances. This approach takes into
fuzziness become exact? account both rater decisions, and the
Language Testing 29.1, 43 – 65. levels of uncertainty in arriving at those
decisions.
2014 Nitta, R & Nakatsuhara, F. This research investigates providing C, H
(2014). A multifaceted approach test-takers with planning time prior to
to investigating pre-task planning undertaking a paired speaking test. The
effects on paired oral unexpected findings are that planning
performance. Language Testing time results in stilted prepared output,
31.2, 147 – 175. and reduced interaction between
speakers.
Acknowledgements
I would like to thank Dr. Gary Ockey of Educational Testing Service for reviewing my first draft, and
providing valuable critical feedback. My thanks are also due to the very constructive criticism of the
three reviewers, which has considerably improved the coverage and coherence of the timeline. Finally
to the editor of Language Teaching for timely guidance and advice.