1.
Language testing and assessment
Definition of test
A test is a sample of an individual’s behaviour/performance on the basis of which inferences
are made about the more general underlying competence of that individual
Language tests involve any kind of measurement/examination technique which aims at
describing the test taker’s foreign language proficiency, e.g. oral interview, listening
comprehension task, or free composition writing.
Language test may differ in test method and purpose.
Test types based on the testing method
Paper-and-pencil language tests
assessment of
o a separate component of the language (grammar, vocabulary)
o receptive understanding (reading, listening)
test item: fixed response format (a number of possible responses are presented, the
candidate is required to choose (e.g.: multiple choice)
o correct answer: key; incorrect answers: distractors
o distractors are chosen based on observations of typical errors of learners
not useful in testing the productive skills (except indirectly)
Performance-based tests
skills are assessed in an act of communication
assessment of speaking and writing
the samples are elicited in context of simulations of real-word tasks in realistic contexts
test taker is assessed by trained raters using an agreed rating process
Test types based on the purpose
Achievement tests
associated with the process of instruction
during or at the end of a course study
whether and where progress has been made in terms of the goals of learning
should support the teaching to which they relate
possible negative effect on teaching: teaching to the test
may be self-enclosed: it may not bear direct relationship to language use
o successful performance does not necessarily indicate successful achievement
relate to the past: they measure what students have learned as a result of teaching
alternative assessment: don’t teach and study for the test, involve students in assessment,
enable them to self-assess their progress
Proficiency tests
relate to the future situation of language use
o without necessarily reference to the previous process of teaching
based on a specification of what candidates have to be able to do in a language
criterion: the students’ real-life language use
include performance features, where characteristics of the criterion setting are represented
o e.g.: test of communicative abilities of a health professional: communicating with
patients
admissions to a foreign university, occupation requiring L2 skills
The criterion:
- criterion: relevant communicative behaviour in the target situation; series of performances
subsequent to the test, the target
- Test: a performance representing samples from the criterion
- some teachers question the value of direct testing how can you test behaviour?
Other limits to testing:
- authenticity: there is an inevitable gap between the test and the criterion
- validity: generizability; does it actually measure what it has to measure?
- Observer’s paradox
Reliability
Reliability shows how precisely we measured. The scores obtained should be very similar to those
which would have been obtained by the same students with the same ability, but at a different time.
The reliability coefficient:
quantify the reliability of a test (between 0 and 1)
compares the reliability of different tests
ideal = 1 → would give the same results for a particular set of candidates regardless of when
it was administered
it can be different for different types of language tests
o a good vocab or reading test is between .90-.99
o auditory comprehension is often .80-.89
o and oral production may be .70-.79
it also depends also on the importance of the decisions that are to be taken on the basis of
the test
determining → need two sets of scores for comparison
o Test-retest method: get a group of subjects to take the same test twice (problematic:
likely to recall items, learning or forgetting might takes place btw the two tests, low
motivation to take the test twice)
o Alternate forms method: use two different forms of the same test; often not
available
o Split half method: each subject given two scores; one score for on half, the other
score for the other → scores used as if the same test had been taken twice →
provides coefficient of internal consistency
The standard error of measurement and the true score
Classical test theory assumes that each person has a true score that would be obtained if there were
no errors in measurement. (by taking the same test over and over again without being affected by
circumstances → scores should vary, we could calculate average score).
Standard error of measurement (SEoM)
based on the reliability coefficient and a measure of the spread of all the scores on the test
o SEoM = 5, candidate scores 56 → his/her true score lies btw 51-61
statements based on what is known about the pattern of scores that would occur if it were
possible to take the test over and over again
Item Response Theory (IRT): estimate how far an individual test taker’s actual score is likely to
diverge from their true score; estimate for each individual, based on their performance on
each of the items on the test
standard error of measurement serves to remind us that in the case of some individuals
there is quite possibly a large discrepancy btw actual score and true score
Reliability cannot be estimated directly since that would require one to know the true scores, which
according to classical test theory is impossible.
Scorer reliability
Ideally the same scorer should give the same scores regardless of the circumstances, and this would
be the same score as would be given by any other scorer on any occasion
Scorer reliability coefficient → quantifies the level of agreement given by the same or
different scorers on different occasions
o scorer reliability coefficient of a multiple choice test: 1 (requires no judgement)
o Interview → a degree of judgement is called for on the part of the scorer, perfect
consistency is not to be expected
How to make tests more reliable
Take enough samples of behaviour (the more items you have on a test, the more reliable the
test will be)
o each additional item should represent a fresh start → gain additional information
Exclude items which do not discriminate well between weaker and stronger students as they
contribute little to the reliability of a test (they are either too easy or too difficult)
Do not allow too much freedom in answering → depressing effect on reliability
Write unambiguous items
Provide clear and explicit instructions, both in written and oral tasks
Ensure that tests are well laid out and perfectly legible
Make candidates familiar with format and design techniques
Provide uniform and non-distracting conditions of administration
Use items that permit scoring which is as objective as possible (multiple choice, open-ended
fill in the gap type of question with one-word answers)
Make comparisons between candidates as direct as possible (provide 2 composition items
instead of 6)
Provide a detailed scoring key
Train scorers
Agree acceptable responses and appropriate scores at outset of scoring
Identify candidates by number, not by name
Have multiple, independent scorers
Reliability and validity
Reliability is a more accurate way of describing precision, while validity is a more precise way of
describing accuracy.
An example often used to illustrate the difference between reliability and validity in the experimental
sciences involves a common bathroom scale. If someone who is 200 pounds steps on a scale 10 times
and gets readings of 15, 250, 95, 140, etc., the scale is not reliable. If the scale consistently reads
"150", then it is reliable, but not valid. If it reads "200" each time, then the measurement is both
reliable and valid.