Note For Weekend Presentation
Note For Weekend Presentation
Mastery Test: is a of the extent to which a student has mastered a specific set of objectives or met
minimum requirements set by the teacher or examining agency.
Diagnostic tests: are used to measure student’s strengths and weaknesses, usually to identify
deficiencies in skills or performance. Such tests may also be used to identify learning problems. Most
often diagnostic tests are designed to provide in-depth measurement to locate the source of a particular
problem. These tests are related to prescriptive tests, which extend to prescribe learning activities
intended to overcome student deficiencies.
Measurement. A broad definition of measurement is often stated as the assignment of numbers to the
characteristics of objectives or events according to rules (Wressma and Jurs, 1985). Measurement
often includes the assignment of a number to express in quantitative terms the degree to which a pupil
possesses a given characteristic. Actually the numerals represent some specific characteristics of the
object or event. However, the numeral itself has no relevance to measurement until it is assigned
quantitative meaning.
Measurements involving, length, weight and volume are common place and readily understandable by
most people, since the quantification in such measurement is quite apparent.
However, measurement of educational attributes, although involves the same general concepts and
ideas is not as easily understood. The crucial element is of course, the rule. For this reason the rule
and what goes into it require specific attention.
Suppose a student is measured on science achievement through the use of 20-item test each item
representing 5 points. The rule is a correct response to an item receives 5 points. The points are then
totaled for the achievement score. Even if the rule is applicable and produces a score representing
quantification, the test cannot produce measurement relevant to the achievement unless the test items
are appropriate.
Assessment. The term assessment is not always used with consistent meaning. Wressma and Jurs
(1985) for example used assessment as synonymous with measurement and defined it as collecting
data in the context of conducting measurement. On the other hand, Payne (1992) defined assessment
as the systematic evaluative appraisal of an individual’s ability and performance in particular
environment or context. It is characterized by synthesis of a variety of data such as: observations,
stress interviews, performance measures, group discussions, individual and group tasks, peer ratings,
projective techniques and various kinds of structured tests. Therefore, assessment was viewed as a
device which is concerned with the totality of educational settings and which is subsuming the terms
measurement and evaluation and more inclusive than them.
Assessment should be considered separately from evaluation, although the two are related.
Assessment includes such activities as grading, examining, determining achievement in a particular
course or measuring an individual attitude about an activity, group, or job. In general, assessment is
the use of various written and oral measures and tests to determine the progress of students toward
reaching the program objectives. To be informative, assessment must be done in a systematic manner,
including ensuring consistency within measures (from one assessment period to the next with the same
instrument) and across measures (similar results achieved with different instruments). Evaluation is the
summarization and presentation of these results for the purpose of determining the overall
effectiveness of the program, the worth of the program, in order to evaluate the program.
These definitions are provided in Appendix II, the document "Uses for Evaluation Data." With this
basic knowledge, we now can turn to the steps in designing an evaluation. After describing the general
steps to an evaluation plan, the specific requirements of the Title VII bilingual education evaluation
will be addressed.
Evaluation. Evaluation is the process of making a value judgment about the worth of a students’
achievement, product, or performance (Nitko, 1996). It is a process that includes measurement and
possibly testing, but it also containing the notion of value judgment. If a teacher administers a science
test to a class and computes the percentage of correct responses, measurement and testing have taken
place. The scores must be interpreted which may mean judging them to be excellent, good, fair or
poor. This process is evaluation because value judgments are being made. Evaluation sometimes is
based on objective data; however, more commonly it involves a synthesis of information from two or
more sources such as test scores, values and impressions. In any event evaluation does include
making a value judgment and in education such judgments are based on objective information. The
following figure shows relationship between test, measurement and evaluation (Wressma & Jurs,
1985).
When a teacher makes value judgments about pupils' performance, then she is doing more than
measuring. She is using measurement data to evaluate. All teachers evaluate pupils. Evaluation takes
place when a teacher determines which students have satisfactorily completed a course and which ones
have not, when the teacher finds that John can operate the microscope better than anyone in the class,
when we decide which students are eligible for participation in interschool competition and which
students are not. In any school, evaluation is inescapable.
A student's performance may be compared with the performance of other students (normative
evaluation) as in the case of John above--he can operate a microscope better than anyone else in the
class; or a student's performance may be compared with a predetermined standard (criterion
evaluation) as in the case of determining which students are eligible for interschool competition.
Formative Evaluation: Testing occurs constantly with learning so that teachers can evaluate the
effectiveness of teaching methods along with the assessment of students' abilities.
Formative: Formative evaluations are initial or intermediate evaluations. Formative evaluation
provides information about the progress of the participant each day throughout a learning unit.
Formative evaluation involves breaking a learning unit down into smaller parts to enable both the
educator and the students to identify the precise of a task or performance that are in error and need
correcting. Formative evaluations should occur throughout the instructional, training or research
process.
Summative Evaluation
Evaluation which tests students' performance to determine students' final overall assimilation of course
material and/or overall instructional method effectiveness is summative.
Summative assessment generally takes place after a period of instruction and requires making a
judgment about the learning that has occurred (e.g., by grading or scoring a test or paper).
Summative: A final, comprehensive judgment conducted near the end of an instruction or training
program. For example, the final grade of A.
Summative Evaluation: Testing is done at the end of the instructional unit. The test score is seen as
the summation of all knowledge learned during a particular subject unit.
A level of achievement relative to a clearly defined subgroup, such as all women or men your
age. It means that you report how well a performance compares with that of others (people of the
same age, gender, or class)
Norm-referenced standards are designed to rank order individuals from best to worst and are
usually expressed in percentile ranks.
Criterion-referenced tests (CRTs) determine "...what test takers can do and what they know, not
how they compare to others (Anastasi, 1988, p. 102). CRTs report how well students are doing
relative to a pre-determined performance level on a specified set of educational goals or outcomes
included in the school, district, or state curriculum.
Classroom test are teacher made tests especially to be used for measuring students achievement.
Standardized Tests are commercially produced tests which are constructed professional test-
makers.
The major distinction between the standardized tests and classroom test is that in a standardized test
systematic sampling of performance (that is students’ scores) has been obtained under prescribed
directions of administrations. They also differ markedly in terms of their sampling content,
construction, norms, and purposes.
Validity
In addition to face validity, evaluations must have content validity. The format of an evaluation
must conform closely to the course objectives that it seeks to evaluate. If a course objective states
that students will be able to apply theories of practice to case studies, then an evaluation should
provide them with appropriate cases to demonstrate this ability.
Finally, effective methods of evaluation have certain predictive characteristics. A student who
performs well on an evaluation concerning a certain skill might be expected to perform well on
similar evaluations on related skills. Additionally, that student might be expected to score
consistently when evaluated in the future.
Reliability
The concept of reliability is closely related to (and often confused with) validity. A reliable method
of evaluation will produce similar results (within certain limitations) for the same student across
time and circumstances.
The mean ( )
The mean of a set of scores is the arithmetic average. In ungrouped data it is found by
summing all the scores and dividing the sum by the number of scores. The mean is
obtained by the following formula
ΣX
= N Where: = mean
ΣX = The sums of all scores X
N = Total number of scores
The mean is the most useful of the three measures of central tendency because many
important statistical procedures are based on it and is based on all of the data in the
distribution. It is also a more reliable measure than either the median or the mode. In
addition to its use of describing a set of data, it is used in inferential statistics to estimate
the population mean.
The Median
The median is the point on the scale of measurement above and below which 50% of the
scores fall. It is the 50th percentile.
Computing the median for ungrouped data essentially consists of identifying the middle
score. If there are odd numbers of scores, the median is the middle score in the
distribution. If the number of scores is even, the median falls between the two middle
scores. Therefore, for the ungrouped data we compute the median by taking the
following steps.
Arrange the scores in descending order (highest to lowest)
( N +1)
th
The median score is the 2 value for both odd and even number of
scores.
Examples: Find the median for
(a) 19, 23, 6, 18, 3, 21, 12
(b) 18, 46, 44,23, 29, 40, 28, 27
Solution
Arrange the scores in descending order
( N +1)
th
Median = 2 value
(7+1)
th
= 2 = 4th value = 18
b) 46, 44, 40, 29, 28, 27 23, 18
( N +1) ( 8+1)
th th
The process of computing the median becomes more complex when scores are grouped
into class intervals. The formula to calculate the median for grouped data is:
Where ll = lower exact limit of the internal containing the N/2 score
N = total number of scores
Cf = cumulative frequency of score below the interval containing the N/2 score
fi = frequency of scores in the interval containing the N/2 score
i = size (width) of the class interval .
Applying the formula for the data in Table 9, the median is:
( N
−cf ) i
Mdn=ll+
2
fi
=ll
[ N ( 0 . 50)−cf
fi ]i
Mdn=38 .5+
( 50
2
−22 ) 3
12 = 38.58
The median is a very useful statistic and can be used for a fairly small distribution with a
few extreme scores, when distribution is badly skewed, or when there are missing scores.
The Mode
The mode is the simplest index of central tendency. It is the most frequent score in the
distribution. In ungroup data it is determined by inspection or counting rather than by
computation. In grouped data the mode can be estimated by using:
√
2
∑i x
Where
σ=
√ ss
N
= ∑
( X−μ )2
N
=
N
s=
√
ss
N−1
= ∑
N−1 √
( X −X )2
Where
= mean of the sample
s = sample standard deviation
The use of definition formula of variance would be a tedious task for large number of
cases. The following computational formula is used to find the sample standard deviation
which is derived algebraically from the definition formula.
√
(∑ 2 X )
2
∑
√ n ∑ x 2 −( ∑ X )
2 2
X−
s=
√ ss
n−1
=
n−1
n
or n(n−1 )
To illustrate the computation of standard deviation, we will consider data in Table 9. The
midpoints of the class intervals serve as X. the procedures for computing values required
in the computation is as follows.
√
n ( √ fx 2 ) −( ∑ fx )
2
s=
n ( n−1 )
= √
50(77 , 099 ) − (1949)2
(50)( 49)
√
3854950−3798601 56349
= 2450
= √22. 999
=
2450
= 4.80
The standard deviation is the measure of dispersion of the scores around the mean. The
mean of the data is 38.98. Therefore, one standard deviation above the mean is 38.98 +
4.80 = 43.78 and two standard deviation above the mean is 38.98 + (4.80 × 2) = 48.58.
Similarly, one standard deviation below the mean is 38.98-4.80 = 34.18 and two standard
deviation below the mean is 38.98 – (4.80 × 2) = 29.38. This means that almost all scores
in the data lies within ± 2 standard deviations.
The first step in determining the nature of the relationship is to graph the data. A graph,
or scatter diagram, of the pairs of scores of two variables (which may be tests scorer) can
be plotted to give a visual illustration of the relationship between the two variables.
For example, suppose that 20 students took two tests, one in English and the other in
History. The pair of scores are presented below.
20.00
15.00
10.00
5.00
The computed value of the perfect positive correlation coefficient is +1.00, and the
computed value of the perfect negative correlation coefficient is -1.00. When no
relationship exists between the two variables, the correlation coefficient is 0.00. The
computed value of the correlation coefficient is a function of the slope of the general
pattern of points in the scotterplot and the width of the ellipse that encloses the points. If
the slope is negative, the sign of the correlation coefficient is negative. If the width of the
ellipse is narrow, the degree of relationship is larger and the correlation coefficient is
larger.
Y −Y
Zy = standard score of Y = Y s
N = Number of cases or observations
The formula states that all X and Y scores must be converted into standard scores or Z
scores and the product of each pair of Z scores computed. These products then summed
and the sum divided by N. More simply r equals the sum of the cross products or two
variables, in standard score form, divided by N.
√[ ( ∑ X ) ][ (∑ Y ) ]
2 2
N∑ X
2
N∑Y
2
− −
Validity
In addition to face validity, evaluations must have content validity. The format of an evaluation
must conform closely to the course objectives that it seeks to evaluate. If a course objective
states that students will be able to apply theories of practice to case studies, then an evaluation
should provide them with appropriate cases to demonstrate this ability.
Finally, effective methods of evaluation have certain predictive characteristics. A student who
performs well on an evaluation concerning a certain skill might be expected to perform well on
similar evaluations on related skills. Additionally, that student might be expected to score
consistently when evaluated in the future.
When constructing or selecting tests and other evaluation instruments, the most
important question is, to what extent will the interpretation of the scores be
appropriate, meaningful, and useful for the intended application of the results? The
validity of data collected implies that the evaluation has actually been focused on
the subject initially targeted for evaluation. For instance, for the sake of validity,
learners’ written and oral skills cannot be evaluated with the same tests.
Validity refers to:
The appropriateness, meaningfulness, and usefulness of the specific
inferences made from test scores.
The consistency (accuracy) with which the scores measure a particular
cognitive ability of interest.
From these definitions we can see that there are two aspects of validity. These are
What is measured (refers to abilities to perform observable tasks, or
command of substantive knowledge)
How consistently it is measured? (Refers to the reliability of the scores)
For example: if a test is to be used to describe students’ achievement, we should be
able to interpret the scores as a relevant and representative sample of the achievement
domain to be measured. If the results are to be used as a measure of pupils reading
comprehension, we should like our interpretation to be based on evidence that the scores
actually reflect reading comprehension and are not distorted by irrelevant factors.
Basically, then, validity is always concerned with the specific use of the results and the
soundness of our proposed interpretations.
When interpreting validity in relation to testing and evaluation, there are certain things to
remember. These are:
1. Validity refers to the appropriateness of the interpretation of the results of a test or
evaluation instrument for a given group of individuals, and the instrument itself.
2. Validity is a matter of degree; it does not exist on an all-or-none basis. Hence, high
validity, moderate validity, and low validity.
3. Validity is always specific to some particular use or interpretation. No test is valid
for all purposes.
4. Validity is a unitary concept. In the most recent revision of the standards, the
traditional view that there are several different types of validity has been discarded.
Instead, validity is viewed as a unitary concept based on various kinds of evidence.
The test constructor builds a paper and pencil test to measure mathematical reasoning.
The mathematical reasoning test would be considered to have construct validity to the
degree that test scores are related to the judgments made from observing behavior
identified by the psychological theory as mathematical reasoning. If the anticipated
relationships are not found, then the construct validity of the inference that the test is
measuring mathematical reasoning is not supported. Construct validation has been
commonly used with theory building and theory testing.
Reliability
The concept of reliability is closely related to (and often confused with) validity. A reliable method of
evaluation will produce similar results (within certain limitations) for the same student across time and
circumstances. Reliability can be defined as the degree of consistency between two measures of
the same thing. It:
Provides the competency that makes validity possible.
Indicates how much confidence we can place in our results.
The concept of reliability as applied to testing and evaluation can be clarified by noting the
following general points.
1. Reliability refers to the results (test scores) obtained with an evaluation instrument and not
to the instrument itself.
2. Estimates of reliability always refer to a particular type of consistency. Test scores are not
reliable in general. They are reliable or generalizable over different
periods of time
sample of questions
raters
3. Reliability is necessary but not sufficient condition for validity. A test that produces
totally inconsistent results cannot possibly provide valid information about the
performance being measured. Low reliability can be expected to restrict the degree of
validity that is obtained. High reliability does not ensure that a satisfactory degree of
validity will be present.
4. Reliability is primarily statistical. Shifts in the relative standing of students in the group
or in terms of the amount of variation to be expected in individuals’ score is reported
by means of reliability coefficient or standard error of measurement.