Course Code: PROF EDU 6
Course Title: ASSESSMENT IN LEARNING 1
College: College of Teacher Education
Instructor: JESSA DOTIMAS
Title of the learning ASSESSMENT IN LEARNING 1
resource:
ROSITA DE GUZMAN-SANTOS, Ph.D.
Author:
TOPIC OUTLINE
Chapter 9: Improving a Classroom-Based Assessment Test
9.1 Overview
9.2 Judgemental Item-Improvement
>Teacher’s Own Review
> Peer Review
> Student Review
9.3 Empirically-Based Procedures
> Difficulty Index
> Discrimination Index
> Distracter Analysis
Chapter 10: Utilization of Assessment Data
10.1 Overview
10.2 Types of Test Scores
> Raw and Percentage Scores
>Percentile Rank
>Standard Scores
> Norma Curve Equivalent
>Developmental Scores
10.3 Types of Score Interpretations
> Norm-referenced Interpretations
> Criterion-referenced Interpretations
Chapter 11: Grading and Reporting of Assessment Results
11.1 Grading
11.2 Analysis and Assessment of Data
11.3 Interpretation and Communication f Grades
11.4 Grading System
11.5 Reporting
11.6 Principles of Effective Grading and Reporting
Chapter 4 Improving a Classroom-Based
Assessment Test
________________________________
OVERVIEW
By the time you reach Chapter 9, it is assumed that you have known how to plan
a classroom test by specifying the purpose for constructing it, the instructional
outcomes to be assessed, and preparing a test blueprint to guide the construction
process. The techniques and strategies for selecting and constructing different item
formats to match the intended instructional outcomes make up the second phase of the
test development process which is the content of the preceding chapter. The process
however is not complete without ensuring that the classroom instrument is valid for the
purpose for which it is intended. Ensuring requires reviewing and improving the items
which is the third phase of the process.
This chapter therefore deals with providing practical and necessary ways for
improving teacher-developed assessment tools. Popham (2011) suggests two
approaches to undertake item improvement: the judgemental approach and the
empirical approach.
OUTCOME
At the end of chapter 9, students are expected to:
a. acquire procedures for improving a classroom-based assesment test
LESSON PROPER: Getting started (Pre-assessment, activating prior
knowledge,
and/or review), Discussion, activities/tasks, assessment
This approach basically makes use of human judgement in reviewing the items.
The judges are the teachers themselves who know exactly what the test is for, the
instructional outcomes to be assessed, and the items’ level of difficulty appropriate to
his /her class; the teacher’s peers or colleagues who are familiar with the curriculum
standards for the target grade level, the subject matter content, and the ability of the
learners; and the students themselves who can perceive difficulties based on their past
experiences.
Teacher’s Own Review
It is always advisable for teachers to take a second look at the assessment tool
s/he has devised for a specific purpose. To presume perfection right away after its
construction may lead to failure to detect shortcomings of the test or assessment task.
There are five suggestions given by Popham (2011) for the teachers to follow in
exercising judgement:
1. Adherence to item-specific guidelines and general item-writing
commandments.
The preceding chapter has provided specific guidelines in writing various forms
of objective and non-objective constructed-response types and the selected-response
types for measuring lower level and higher-level thinking skills. These guidelines
should be used by the teachers to check how good the items have been planned and
written particularly in their alignment to intended instructional outcomes.
2. Contribution to score-based inference.
The teacher examines if the expected scores generated by the test can
contribute to making valid inference about the learners. Can the scores reveal the
amount of learning achieved or show what have been mastered? Can the score infer
the student’s capability to move on to the next instructional level? Or rather the
scores obtained do not make any difference at all in describing or differentiating
various abilities.
3. Accuracy of content.
The review should especially be considered when tests have been developed
after a certain period of time. Changes that may occur due to new discoveries or
developments can redefine the test content of a summative test. If this happens, the
items or the key to correction may have to be revisited.
4. Absence of content gaps.
This review criterion is especially useful in strengthening the score based
inference capability of the test. If the current tool misses out on important content now
prescribed by a new curriculum standard, the score will likely not give an accurate
description of what is expected to be assessed. The teacher always ensures that he
assessment tool matches what is currently required to be learned. This a way to
check on the content validity of the test.
5. Fairness.
The discussions on item writing guidelines always give warning on unintentionally
favoring the uniformed students obtain higher scores. These are due inadvertent
grammatical clues, unattractive distracters, ambiguous problems and messy test
instructions. Sometimes unfairness can happen because of due advantage received by
a particular socio-economic level. Getting rid of faulty and biased items and writing
clear instructions definitely add to the fairness of the test.
Peer Review
There are schools that encourage peer or collegial review of assessment instruments
among themselves. Time is provided for this activity and it has almost always yielded
good results for improving tests and performance-based assessment tasks. During
these teacher dyad or triad sessions, those teaching the same subject ares can openly
review together the classroom tests and tasks they have devised against some
consensual criteria.
Student Review
Engagement of students in reviewing items has become a laudable practice for
improving classroom tests. The judgement is based on the students’ experience in
taking the test, their impressions and reactions during the testing event. The
process can be efficiently carried out through the use of a review questionnaire. It is
better to conduct the review activity a day after taking the test so the students still
remember the experience when they see a blank copy of the test.
EMPIRICALLY-BASED PROCEDURES
Item-improvident using empirically-based methods is aimed at improving the
quality of an item using student’s response to the test. Test developers refer to this
technical process as item analysis as it utilizes data obtained separately for each item.
An item is considered good when its quality indices, i.e., difficulty index and
discrimination index, meet certain characters. For a norm-referenced test, these two
indices are related since the level of difficulty of an item contributes to its
discriminability. An item is good if it can discriminate between those who perform will
in the test and those who do not. However, or an extremely easy item, that which can
be answered correctly by more than 85% of the group, or an extremely difficult item,
that which can only be answered correctly by 15% is not expected to perform well as
a “ discriminator”. The group will appear to be quite homogenous with items if this kind
they are weak items since they do not contribute to “score-based inference.”
Difficulty Index
An item’s difficulty index is obtained by calculating the p value (p) which is the
proportion of students answering the item correctly.
p= R/T
Where p is the difficulty index
R=Total number of students answering the item right
T= Total number of students answering the item
Here are two illustrative samples:
Item 1: There were 45 students in the Item 1 has a p value of 0.67. Sixty-seven
class who responded to item 1 and 30 percent (67%) got the item right while
answered it correctly. 33% missed it.
P=30/45
= 0.67
Item 2: In the same class, only 10 Item 2 ha a p value of 0.22. Out of 45
responded correctly in item 2. only 10 or 22% got the item right while
P= 10/45 35 or78% missed it.
=.22
For Normative-referenced test: Between the two items, item 2 appears to be a much
more difficult item since less than a fourth of the class only was able to respond
correctly.
For Criterion-referenced test: The class shows much better performance in item 1
than in item 2. It is still a long way for many to master item 2.
Discrimination Index
Item-discrimination index shows relationship between the student’s
performance in an item (i.e., right or wrong) and his/her total performance in the test
represented by the total score. Item-total correlation is usually part of a package for
item analysis. Getting high item-total correlations indicate that the items contribute
well to the total score so that responding correctly to these items gives a better
chance of obtaining relatively high total scores in the whole test or subtest.
For classrooms tests, the discrimination index shows if a difference exists
between the performance of those who scored high and those who scored low in an
item. As a general rule, the higher the discrimination index (D), the more marked the
magnitude of the difference is, and thus, the more discriminating the item is. The
nature of the difference however, can take different directions:
A. Positively discriminating item- proportion of high scoring group is greater
than that of the low scoring group.
B. Negatively discriminating item- proportion of high scoring group is less
than that of the low scoring group.
C. Not discriminating- proportion of high scoring group is equal to that of the
low scoring group.
Table 9.2 Guidelines for Evaluating the Discriminating Efficiency of Items
Discrimination Item Evaluation
.40 and above Very good items
.30 - .39 Reasonably good items, but possibly
subject to improvement
.20 -.29 Marginal items,usually needing
improvement
.19 and below Poor items, to be rejected or improved by
revision
Distracter Analysis
Another empirical procedure to discover areas for item-improvement utilizes an
analysis of the distribution of responses across the distracters. Especially when the
difficulty index and discrimination index of the item seem to suggests its being
candidate for revision, distracter analysis becomes a useful follow-up. It can detect
differences in how the more able students respond to the distracters in a multiple-
choice item compared to how the less able ones do it. It can also provide an index of
the plausibility of the alternatives, that is, if they are functioning as good distracters not
chosen at all, especially by the uniformed students need to be revised to increase
their attractiveness.
Post-Assessment
Activity 1
Below are descriptions of procedures done to review and improve items.
On the space provided, write J if a judgemental approach is used and E if
empirically-based.
__________1. The Math Coordinator of Grade VII classes examined the periodical
tests prepared by the Math teachers to see if their items are aligned to the target
outcomes for the first quarter
__________2. The alternative of the multiple-choice items of the Social Studies test
were reviewed to discover if they have only one correct answer.
__________3. To determine if the items are efficiently discriminating between the
more able students from the less able ones, a Biology teacher obtained the
discrimination index (D) of the items.
_________4. A Technology Education teacher was interested to see if the criterion-
referenced test he has devised shows a difference in the item’s post-test and pre-test
p-values.
_________5. An English teacher conducted a session with his students to find out if
there are other responses acceptable in their literature test. He encouraged them to
rationalize their answer.
Chapter 10: Utilization of Assessment Data
OVERVIEW
As we have learned in section 1,tests are forms of assessment.they are
administered to collect data about student learning. Test results can aid in
making informed decisions to improve curriculum and instruction.
Students are interested to know,”What is my score in the test?” In this
chapter, we shall look into the types of test scores. However, the more pressing
question is, “What does the score mean?” Hence, this chapter shall likewise
present test score interpretations using norm and criterion-referenced
interpretations. You have encountered these in section 1. We shall now have a
more comprehensive discussion.
OUTCOME
At the end of Chapter 10, students are expected to:
Provide meaning to test result using norm-referenced and criterion-referenced
interpretation.
TYPES OF TEST SCORES
Results of tests are in the form of scores, and these may be raw scores;
percentile, ranks, z-scores, T-scores, stanines, or level, category or proficiency score
(Harris 2003).
Raw and Percentage Scores
Raw scores are obtained by simply counting the number of correct responses
in a test following the scoring directions. For instance, a student who gets 30of 50
items in a Math test correctly obviously gets a raw score of 30. This means the
students was able to answer 60%of the items accurately which is the percentage
score. This percentage score is useful in describing a student’s performance based on
a criterion. However, the problem with raw and percentage scores is that they do not
provide adequate information about student performance. They do not take into
consideration the average score (mean) of those who took the test and how dispersed
the score are (standard deviation). This is the raw why raw scores are oftentimes
converted to another score type.
Percentile Rank
A percentile rank gives the percent of scores that are at or below a raw or
standard score. It is used to rank students in a reference sample. This should not be
confused with the percentage of correct answers. A score of 35 or a standard score of
1 falls in the 84th percentile which means the student scored as well or better than
84% of students in the sample.
Standard Scores
A. Z-Score
The z-score gives the number of standard deviations of a test score above or
below the mean. The formula is Z= X - X2 where x is the test score, x2 is the average
S
score and s is the standard deviation. A negative z-score means it is below the
average, while a positive z-score value means it is above the average.
B. T- Score
Some teachers are not comfortable with using z-scores because of negative
numbers. For example, a score of 12 in a 20-item Science test with an arithmetic
average of 15 and standard deviation of 2 yields a z-score of -1.5. But the student or
his/her parents may not fully comprehend why it is negative. In a T-score scale, the
mean is set to 50 instead of 0, and the standard deviation is 10. To transform a z-
score to a T-score, we multiply the z-score by 10 and add 50, i.e., T=10z + 50.
C. Stanine
Stanine, short for standard nine, is a method of scaling scores on a nine-point
scale. A raw score is converted to a whole number from a low 1 o a high 9. Unlike in
z-score where the mean and standard deviation are 0 and 1, respectively, stanines
have a mean of 5 and standard deviation of 2. Stanines scores of 1, 2 and 3 are
below average; 4, 5 and 6 are average; 7, 8 and 9 are above average.
D. Normal Curve Equivalent
The Normal Curve Equivalent (NCE) is a normalized standard score within the
range 1-99. It has a mean of 50 and a standard deviation of 21.06
E. Developmental Scores
A grade equivalent (GE) describes a learner’s developmental growth. It gives a
picture as to where he/she is on achievement continuum. For example, if a second
grader obtained a grade equivalent to 4.2 in langue mechanics, it means his/her score
is typical of a fourth grader, 2 months into instruction. Note that a GE is an estimate
of a learner’s location in the development continuum ad not the grade level where
he/she should be placed.
TYPES OT A TEST SCORE INTERPRETATIONS
A frame of reference is some well-defined performance domain (Mehrens and
Lehmann, 1985). This is needed to make sense of test scores. Scores and marks
maybe explained in relation to a norm or criterion. These references were conceived
and differentiated by American psychologist Robert Glaser in 1963. the use of norm
and criterion-referenced measures hinges on the purpose of assessment. For
example, when assessing driving skills, the student driver’s performance is checked
against a set of criteria. Rendering judgement whether he/she passes or fails is
criterion-reference. In selecting top students who will participate in an inter-school
science quiz bee, the relative standing of a student in a science test is necessary.
Selection is made based on norm-referenced scores.
A. Norm-referenced Interpretations
The term “norm” originated from the Latin word norma which means precept or
rule. By definitions, it pertains to the average score in a test. Apart from school
average norm, there are other types of norms that can be reported: international,
national and local norm groups, and special norm groups (e.g students who are
visually impaired.
Norm-referenced interpretations are explanations of a learner’s performance in
comparison with other learners of the same age or grade.
A norm-referenced framework uses a percentile ranks, standard scores and
stanine.
Five guidelines when interpreting norm-referenced test scores
1. Detect any unexpected pattern of scores.
2. Determine the reasons for score patterns.
3. Do not expect surprises for every student.
4. Small differences in subtest scores should be viewed as chance fluctuations.
5. Use information from various assessments and observations to explain
performance.
B. Criterion-referenced Interpretations
The word “criterion” came from the Greek word kriterion which means
standard. And so, criterion-referenced interpretations provide meaning to test scores
by describing what the learner can and can not do in light of standard.
Criterion-referenced scores include percentage correct, speed of performance,
quality ratings and precision of performance.
Criterion-referencing is used in diagnosing student’s needs and monitoring their
progress. It is likewise used in certification and program evaluation. It is the preferred
mode of assessment in an outcome-based education framework.
Post-Assessment
Activity 2
Indicate whether each of the following allows a Criterion-referenced
interpretation (CR), a Norm-referenced (NR) interpretation or neither. Justify your
answer.
1. A test was administered to determine if students can demonstrate adequate
knowledge about fractions. They must answer at least 75% of the items. Beatriz
earned an 80% on the test.
_________________________________________________________________.
2. In a Math test, Amy’s scored in the 90th percentile. This connotes that Amy’s score
is greater than 90% of the scores of students who took the Math test.
_________________________________________________________________.
3. Shiela, a kindergartener, can select the specified number of objects when verbally
given the numbers from 1 to 10. she was able to answer the teacher’s questions
correctly to achieve an advanced level of proficiency.
_________________________________________________________________.
4. John’s T-score in a Science test is 40. This means he is a distance of one standard
deviation below the mean of the group.
_________________________________________________________________.
5. Christian received a stanine of 3 on a reading subtest of a standardized test. This
means that his raw score was in the lower 23% of the scores in a group.
_________________________________________________________________.
6. Purita took a spelling test. With a point awarded to each correct answer, Purita
garnered a score of 20.
_________________________________________________________________.
Chapter 11: Grading and Reporting of Assessment Results
OVERVIEW
It is easy to confuse assessment from grading, but they apparently different.
One difference is that assessment centers on the learner. Assessment gathers
information about what the student knows and what he/she can do. Grading is part of
evaluation as it involves judgement made by the teacher.
In this chapter, we shall look into the grading system in the Philippines- the
weighted grading system and final rating. The different reporting system shall also be
discussed. A short segment on progress monitor is included to provide you with an
idea of how to track student progress through formative assessment.
OUTCOME
At the end of chapter 11, students are expected to:
Utilize assessment results to monitor students’ learning progress and
achievements.
GRADING
Grading is a process of assigning a numerical value, letter or symbol to
represent student knowledge and performance. These are called grades,
etymologically a “degree of measurement”. according to Guskey (2004), grading
serves six roles: 1) to communicate achievement status of students to parents and
other stakeholders; 2)to provide information to students for self-evaluation; 3) to
select, identify or sort students for specific programs; 4) to provide incentives for
students to learn; 5) to evaluate effectiveness of instructional programs; and 6) to
provide evidence to students’ lack of effort or inability to accept responsibility for
inappropriate behavior.
Analysis of Assessment Data
Musial, Nieminen, Thomas and Burke (2009) wrote that a grade has two critical
analysis of assessment data and interpretation and communication of grades. In
analyzing assessment data, teachers’ must make sure that grades reflect student
learning, and performance. Assessment data may become distorted for another
reasons. Providing incentives for perfect attendance, extracurricular activities and
good conduct in class may inflate a learner’s grade. Problem arises when their score
in national standardized tests do not compare with their classroom performance. But a
problem also occurs when teachers do not recognize students’ efforts an class
participation.
Interpretation and Communication of Grades
The main purpose of grading is to communicate information to students and
parents about student achievement or goal attainment (Nikko and Brookhart, 2011).
parents have the right to know about the goal attainment of their children. Teacher
must be able to effectively communicate this. However, teachers and parents are not
sync in their understanding of the meaning of grades.
Grading Systems
A. Types of Comparisons
Thus far, you have learned that a students’ performance is interpreted in
relation to the performance of other students (norm) or established standards
(criterion). In grading, the same references are used. Norm-referenced grading
focuses on performance of one’s peers, while criterion-referenced grading focuses on
defined learning targets (Nikko and Brookhart, 2011). Norm-referenced grading is
grading with relative standards. There is a predetermined grade distribution.
B. Approaches to Grading
1. Numerical Grades (100, 99, 98…). The system of using numerical grade is popular
and well-understood. They are preferred because they conveniently summarize
overall student performance. Averaging is possible. However, some teachers may still
find it difficult to give meaning to numerical grades especially on differences between
values (e.g. 75 and 76). Interpretations may vary among teachers, subjects and
schools.
2. Letter Grades (A, B, C, etc…). Letter grades have the same advantages and
disadvantages of numerical grades. They appear to be less intimidating compared to
numerical grades. However, they lack granularity because the codes do not
differentiate two numerical grades that have the same letter symbol. Letter grades
are typically in American grade schools and high schools.
3. Two-category Grades (Pass-Fail; Satisfactory-Unsatisfactory). This is less
stressful to students because they need not fear about low grade point of averages.
However, some students may tend to aspire for just a minimum level of competence
to pass the subject/course. Another drawback is the lack of information this system
provides regarding students’ strong and weak points.
4. Checklist. A checklist maybe simple or elaborate. In a dichotomous checklist, the
teacher simply places a check mark next to observed performance statements. In a
checklist of objectives with scales of performance, ratings of student performance are
made indicating the extent of attainment of the learning objectives or outcomes.
Checklists are common in the elementary level can very well replace r supplement
traditional grading and reporting systems.
5. Standards-based (Advanced, Proficient,…, Beginning or Similar). Standards-
based grading requires teachers to base grades from definite learning standards, and
compels them to distinguish product, process and progress criteria in assigning grade
9Guskey, Swan, and Jung, 2011). Standards are broader and at a higher level than
“criteria of a performance” (Mcmillan, 2007).
REPORTING
A report card is a common method of reporting a learner’s abilities and
progress. It contains a learner’s numerical or letter grades plus other relevant
information. It is periodically submitted by a school to the parents. Note that grade
books and report cards may vary in format.
PRINCIPLES OF EFFECTIVE GRADING AND REPORTING
1. Grades and reports should be based on clearly specified learning goals and
performance standards.
Grades should reflect achievement of learning standards. They should have
qualitative descriptions of the quality of work students have shown or produced.
2. Evidence used for grading should be valid.
Students should be assessed on what they are taught. Grades should not be
influenced by non-academic criteria like attendance, social behaviors, attitudes,
among others. These muddle the final grade reducing the validity of interpretation.
Additionally, a student’s grade should not be based on a one-time assessment that
spells either or utter disaster for students.
3. Grading should be based on established criteria.
Instead of using a norm-referenced system, grades should mirror how learners
have attained the learning targets. Teachers should work towards ensuring that
students in the class achieve mastery.
4. Not everything should be included in grades.
as mentioned in the previous segment, assessment for formative purposes are
all about feedback. Hey are documented as evidences of progress. Summative
assessments are the accountability measures that determine what students can or
cannot do after an instructional unit.
5. Avoid grading based on averages.
O’Connor and Wormeli (2011) claimed that averaging grades (or scores)
falsifies grade reports. Deviations in performance that occur during the learning
process should not be included. Suppose a student obtained a low score in the first
test but was able to get a high mark in a second test on the topic. It would be incorrect
to average them out, but instead choose the second assessment as a valid indicator
of mastery.
6. Focus on achievement and report other factors separately.
It was already pointed out that academic elements should be reported
separately from non-academic factors (Allen, 2005; O’Connor and Wormeli, 2011;
Kubiszyn and Borich, 2010, Guskey, 2004). Although separate non-academic factors
can be used to support student learning.
Post-Assessment
Activity 3
1. Based from the 5 approaches of grading, as a future teacher, which
one do you think is the best to use in grading your students and why? Explain your
answer. (5 points)
2. In your own comprehension, explain what is norm-referenced grading
and criterion-referenced grading. (5 points)