Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views19 pages

Note For Weekend Presentation

Uploaded by

wakgarimideksa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views19 pages

Note For Weekend Presentation

Uploaded by

wakgarimideksa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

1.

Definition and classification of tests


Test. Test has narrower concept than measurement, assessment, or evaluation. Test most commonly
refers to a set of items or questions designed to be presented to one or more students under specified
conditions (Wressma and Jurs, 1985). When a test is given measurement takes place; however, all
measurement is not necessarily testing. Suppose a teacher records information about the learning
styles preferred by the student. This is an example of measurement but not considered as testing.
Achievement test: is a test that measures the extent to which an individual has achieved something -
acquired certain information or mastered certain skills, usually as a result of specific instruction or
general schooling.

Aptitude test: is usually a measure of cognitive or psychomotor domain of the likelihood of an


individual’s benefiting from a training program.

Mastery Test: is a of the extent to which a student has mastered a specific set of objectives or met
minimum requirements set by the teacher or examining agency.

Diagnostic tests: are used to measure student’s strengths and weaknesses, usually to identify
deficiencies in skills or performance. Such tests may also be used to identify learning problems. Most
often diagnostic tests are designed to provide in-depth measurement to locate the source of a particular
problem. These tests are related to prescriptive tests, which extend to prescribe learning activities
intended to overcome student deficiencies.

Measurement. A broad definition of measurement is often stated as the assignment of numbers to the
characteristics of objectives or events according to rules (Wressma and Jurs, 1985). Measurement
often includes the assignment of a number to express in quantitative terms the degree to which a pupil
possesses a given characteristic. Actually the numerals represent some specific characteristics of the
object or event. However, the numeral itself has no relevance to measurement until it is assigned
quantitative meaning.

Measurements involving, length, weight and volume are common place and readily understandable by
most people, since the quantification in such measurement is quite apparent.
However, measurement of educational attributes, although involves the same general concepts and
ideas is not as easily understood. The crucial element is of course, the rule. For this reason the rule
and what goes into it require specific attention.

Suppose a student is measured on science achievement through the use of 20-item test each item
representing 5 points. The rule is a correct response to an item receives 5 points. The points are then
totaled for the achievement score. Even if the rule is applicable and produces a score representing
quantification, the test cannot produce measurement relevant to the achievement unless the test items
are appropriate.

Assessment. The term assessment is not always used with consistent meaning. Wressma and Jurs
(1985) for example used assessment as synonymous with measurement and defined it as collecting
data in the context of conducting measurement. On the other hand, Payne (1992) defined assessment
as the systematic evaluative appraisal of an individual’s ability and performance in particular
environment or context. It is characterized by synthesis of a variety of data such as: observations,
stress interviews, performance measures, group discussions, individual and group tasks, peer ratings,
projective techniques and various kinds of structured tests. Therefore, assessment was viewed as a
device which is concerned with the totality of educational settings and which is subsuming the terms
measurement and evaluation and more inclusive than them.

Assessment should be considered separately from evaluation, although the two are related.
Assessment includes such activities as grading, examining, determining achievement in a particular
course or measuring an individual attitude about an activity, group, or job. In general, assessment is
the use of various written and oral measures and tests to determine the progress of students toward
reaching the program objectives. To be informative, assessment must be done in a systematic manner,
including ensuring consistency within measures (from one assessment period to the next with the same
instrument) and across measures (similar results achieved with different instruments). Evaluation is the
summarization and presentation of these results for the purpose of determining the overall
effectiveness of the program, the worth of the program, in order to evaluate the program.
These definitions are provided in Appendix II, the document "Uses for Evaluation Data." With this
basic knowledge, we now can turn to the steps in designing an evaluation. After describing the general
steps to an evaluation plan, the specific requirements of the Title VII bilingual education evaluation
will be addressed.

Evaluation. Evaluation is the process of making a value judgment about the worth of a students’
achievement, product, or performance (Nitko, 1996). It is a process that includes measurement and
possibly testing, but it also containing the notion of value judgment. If a teacher administers a science
test to a class and computes the percentage of correct responses, measurement and testing have taken
place. The scores must be interpreted which may mean judging them to be excellent, good, fair or
poor. This process is evaluation because value judgments are being made. Evaluation sometimes is
based on objective data; however, more commonly it involves a synthesis of information from two or
more sources such as test scores, values and impressions. In any event evaluation does include
making a value judgment and in education such judgments are based on objective information. The
following figure shows relationship between test, measurement and evaluation (Wressma & Jurs,
1985).

When a teacher makes value judgments about pupils' performance, then she is doing more than
measuring. She is using measurement data to evaluate. All teachers evaluate pupils. Evaluation takes
place when a teacher determines which students have satisfactorily completed a course and which ones
have not, when the teacher finds that John can operate the microscope better than anyone in the class,
when we decide which students are eligible for participation in interschool competition and which
students are not. In any school, evaluation is inescapable.

A student's performance may be compared with the performance of other students (normative
evaluation) as in the case of John above--he can operate a microscope better than anyone else in the
class; or a student's performance may be compared with a predetermined standard (criterion
evaluation) as in the case of determining which students are eligible for interschool competition.

Formative Evaluation: Testing occurs constantly with learning so that teachers can evaluate the
effectiveness of teaching methods along with the assessment of students' abilities.
Formative: Formative evaluations are initial or intermediate evaluations. Formative evaluation
provides information about the progress of the participant each day throughout a learning unit.
Formative evaluation involves breaking a learning unit down into smaller parts to enable both the
educator and the students to identify the precise of a task or performance that are in error and need
correcting. Formative evaluations should occur throughout the instructional, training or research
process.

Summative Evaluation

Evaluation which tests students' performance to determine students' final overall assimilation of course
material and/or overall instructional method effectiveness is summative.

Summative assessment generally takes place after a period of instruction and requires making a
judgment about the learning that has occurred (e.g., by grading or scoring a test or paper).

Summative: A final, comprehensive judgment conducted near the end of an instruction or training
program. For example, the final grade of A.

Summative Evaluation: Testing is done at the end of the instructional unit. The test score is seen as
the summation of all knowledge learned during a particular subject unit.

Norm-referenced evaluation (NRT ) is evaluation based on a comparison of a student's


performance with one or more other student's performance on the same test.

A level of achievement relative to a clearly defined subgroup, such as all women or men your
age. It means that you report how well a performance compares with that of others (people of the
same age, gender, or class)

Norm-referenced standards are designed to rank order individuals from best to worst and are
usually expressed in percentile ranks.

Criterion-referenced evaluation is evaluation based on a comparison of a student's performance


with some preset performance standard which is determined independently of the test, or test
scores.
With a criterion referenced approach, the examiner compares an individual's result against the
specific, predetermined level of achievement (standard) to determine mastery.

Criterion-referenced standards are a minimum proficiency or pass-fail standard.

Criterion-referenced tests (CRTs) determine "...what test takers can do and what they know, not
how they compare to others (Anastasi, 1988, p. 102). CRTs report how well students are doing
relative to a pre-determined performance level on a specified set of educational goals or outcomes
included in the school, district, or state curriculum.

Item analysis for criterion-referenced mastery tests


The item analysis procedure used with norm-referenced tests is no directly applicable to criterion-
referenced mastery tests. The reasons are:
 CRTs are designed to describe pupils in terms of the types of learning tasks they can
perform
 CRT items measure the effects of the instruction not to ranking of students
 In the preparation of items for Criterion-referenced tests the item writer need not make a
conscious decision to write items that will be about moderate difficulty.
 Level of difficulty and discriminating index of criterion-referenced test items are determined
by the learning outcome they are designed to measure.
The quality of test items to discriminate between high and low achievers is not crucial to criterion-
referenced test items. Because some good items might have very low, or zero, indexes of
discrimination. If all students answered a test item correctly the item would be eliminated from
norm-referenced tests. In criterion-referenced tests this may indicate that both the instruction and
the item have been effective. An item earmarked for revision for one type of test may be selected
without change for use in the other type.
The steps followed in the analysis of criterion-referenced mastery items are:
 The same test is given before instruction (pretest) and after instruction (posttest) to
determine the extent the test items measure the effects of the instruction.
 Prepare a chart by listing the number of the items across the top of the chart and the students
name down the side of the chart.
 Record correct (+) and incorrect (-) responses for each pupil on the pretest (B) and post test
(A) as indicated below.
Classroom Vs Standardized Tests

Classroom test are teacher made tests especially to be used for measuring students achievement.

Standardized Tests are commercially produced tests which are constructed professional test-
makers.

The major distinction between the standardized tests and classroom test is that in a standardized test
systematic sampling of performance (that is students’ scores) has been obtained under prescribed
directions of administrations. They also differ markedly in terms of their sampling content,
construction, norms, and purposes.
Validity

It is of paramount importance that the method of evaluation employed be able to accurately


measure the skill or knowledge that it seeks to measure, that it be valid. It is also important that
evaluations exhibit what is known as face validity. Face validity means that elements of the
evaluation appear to be related to stated course objectives. It is a common student complaint that
they could not perceive the connection between the evaluation and course objectives. It is therefore
necessary not only that the instructor be able to make a connection between the evaluation and the
course, but that the student be able to do so as well.

In addition to face validity, evaluations must have content validity. The format of an evaluation
must conform closely to the course objectives that it seeks to evaluate. If a course objective states
that students will be able to apply theories of practice to case studies, then an evaluation should
provide them with appropriate cases to demonstrate this ability.

Finally, effective methods of evaluation have certain predictive characteristics. A student who
performs well on an evaluation concerning a certain skill might be expected to perform well on
similar evaluations on related skills. Additionally, that student might be expected to score
consistently when evaluated in the future.

Reliability

The concept of reliability is closely related to (and often confused with) validity. A reliable method
of evaluation will produce similar results (within certain limitations) for the same student across
time and circumstances.

Measures of central tendency


Averages or measures of central tendency are descriptive properties of a set of observations or
their corresponding frequency distributions. The average is a central reference value which is
usually close to the point of greatest concentration of the measurements and may in some
sense be thought to typify the whole set. The three commonly used measures of central
tendency are the mean, the median and the mode. Each is described below.

The mean ( )

The mean of a set of scores is the arithmetic average. In ungrouped data it is found by
summing all the scores and dividing the sum by the number of scores. The mean is
obtained by the following formula

ΣX
= N Where: = mean
ΣX = The sums of all scores X
N = Total number of scores

The mean of the 50 scores in Table 8 is

Σ fX 49+( 47×2 )+ .. .+(31×2)+30 1949


= N = 50 = 50 = 38.98

The mean is the most useful of the three measures of central tendency because many
important statistical procedures are based on it and is based on all of the data in the
distribution. It is also a more reliable measure than either the median or the mode. In
addition to its use of describing a set of data, it is used in inferential statistics to estimate
the population mean.

The Median
The median is the point on the scale of measurement above and below which 50% of the
scores fall. It is the 50th percentile.

Computing the median for ungrouped data essentially consists of identifying the middle
score. If there are odd numbers of scores, the median is the middle score in the
distribution. If the number of scores is even, the median falls between the two middle
scores. Therefore, for the ungrouped data we compute the median by taking the
following steps.
 Arrange the scores in descending order (highest to lowest)
( N +1)
th

 The median score is the 2 value for both odd and even number of
scores.
Examples: Find the median for
(a) 19, 23, 6, 18, 3, 21, 12
(b) 18, 46, 44,23, 29, 40, 28, 27
Solution
 Arrange the scores in descending order

( N +1)
th

Determine the 2 value for the data in (a) and (b).

a) 23, 21, 19, 18, 12, 6, 3


( N +1)
th

Median = 2 value
(7+1)
th

= 2 = 4th value = 18
b) 46, 44, 40, 29, 28, 27 23, 18
( N +1) ( 8+1)
th th

Median = ( 2 value = 2 value = 4.5th value =


(29+28)
2 = 28.5

The process of computing the median becomes more complex when scores are grouped
into class intervals. The formula to calculate the median for grouped data is:
Where ll = lower exact limit of the internal containing the N/2 score
N = total number of scores
Cf = cumulative frequency of score below the interval containing the N/2 score
fi = frequency of scores in the interval containing the N/2 score
i = size (width) of the class interval .

Applying the formula for the data in Table 9, the median is:
( N
−cf ) i
Mdn=ll+
2
fi
=ll
[ N ( 0 . 50)−cf
fi ]i

Mdn=38 .5+
( 50
2
−22 ) 3

12 = 38.58

The median is a very useful statistic and can be used for a fairly small distribution with a
few extreme scores, when distribution is badly skewed, or when there are missing scores.

The Mode
The mode is the simplest index of central tendency. It is the most frequent score in the
distribution. In ungroup data it is determined by inspection or counting rather than by
computation. In grouped data the mode can be estimated by using:

Mode = 3 median - 2 mean


= 3 (38.58) – 2(38.98) = 37.78
The mode is used when there is a need to quickly estimate the central tendency, when
there are large number of cases, or when the data is nominal or categorical.

5.1.3. Measures of variations


Measures of variability are indicators of the dispersion of the distribution of scores.
Although averages summarize the central tendency of a group of scores, they do not
summarize how the raw the scores spread out over the score scale. For example, the
mean of the English test for two 10th grade classes may be 65. However, in one class the
scores may range widely from 90 to 25, while in the other the scores may range from 60
to 70. Obviously, the students in the latter class are more nearly alike in their English
achievement than students in the former class. You will need to create more widely
different English levels when teaching the former class than when teaching the latter.
This section describes one of the measures of variability, its square root, the variance and
the standard deviation.
Variance
The standard deviation measures how widely the scores in the distribution are spread
about the mean. It is the square root of the average squared difference between the scores
and the mean, the variance.
The standard deviation of a population is symbolically defined


2
∑i x
Where
σ=
√ ss
N
= ∑
( X−μ )2
N
=
N

 = mean of the distribution


N = total number of scores in the distribution
ss = sum of squares
X = any raw score
x = any deviation score
 = population standard deviation
The standard deviation of a sample is symbolically can be defined as

s=

ss
N−1
= ∑
N−1 √
( X −X )2

Where
= mean of the sample
s = sample standard deviation
The use of definition formula of variance would be a tedious task for large number of
cases. The following computational formula is used to find the sample standard deviation
which is derived algebraically from the definition formula.


(∑ 2 X )
2


√ n ∑ x 2 −( ∑ X )
2 2
X−
s=
√ ss
n−1
=
n−1
n
or n(n−1 )

To illustrate the computation of standard deviation, we will consider data in Table 9. The
midpoints of the class intervals serve as X. the procedures for computing values required
in the computation is as follows.

Table 10: Data for the computation of the standard deviation


Midpoint fX fX2
Class Interval f
(X)
48-50 1 49 49 2401
45-47 6 46 276 12696
42-44 9 43 387 16641
39-41 12 40 480 19200
36-38 8 37 296 10952
33-35 9 34 306 10404
30-32 5 31 155 4805
Total 50 1949 77099
From the Table 9 the following results are obtained, which are inserted in the formula of
standard deviation for computation of the standard deviation of the data in Table 10.
∑ fX = 1,949
∑ fX 2 = 77 ,099
n = 50


n ( √ fx 2 ) −( ∑ fx )
2

s=
n ( n−1 )

= √
50(77 , 099 ) − (1949)2
(50)( 49)


3854950−3798601 56349
= 2450
= √22. 999
=
2450

= 4.80

The standard deviation is the measure of dispersion of the scores around the mean. The
mean of the data is 38.98. Therefore, one standard deviation above the mean is 38.98 +
4.80 = 43.78 and two standard deviation above the mean is 38.98 + (4.80 × 2) = 48.58.
Similarly, one standard deviation below the mean is 38.98-4.80 = 34.18 and two standard
deviation below the mean is 38.98 – (4.80 × 2) = 29.38. This means that almost all scores
in the data lies within ± 2 standard deviations.

6.1.4. Measures of Correlation


Correlation deals with the extent to which two or more variables are related. The
magnitude of the relationship between the variables is measured by an index called
correlation coefficient.

The first step in determining the nature of the relationship is to graph the data. A graph,
or scatter diagram, of the pairs of scores of two variables (which may be tests scorer) can
be plotted to give a visual illustration of the relationship between the two variables.
For example, suppose that 20 students took two tests, one in English and the other in
History. The pair of scores are presented below.

Table 11: English and History Scores


Student English Afan Oromoo
1 7 5
2 10 12
3 18 10
4 20 12
5 11 17
6 8 13
7 25 15
8 16 18
9 10 18
10 8 5
11 18 14
12 29 20
13 24 22
14 16 12
15 9 10
16 16 18
17 12 6
18 11 9
19 24 23
20 14 22

The scatterplot of the data is given blow.


25.00
History Scores

20.00

15.00

10.00

5.00

5.00 10.00 15.00 20.00 25.00 30.00


English Scores
Note that the points plotted tend to run from the lower left to the upper right. This pattern
represents a positive correlation between the two variables. In other words, a positive
relationship is represented when high scores on variable X are associated with high
scores on variable Y. On the other hand, points running from the upper left to the lower
right would represent a negative correlation. A perfect relationship between two
variables exists when all points in the scatterplot lie on a straight-line. A scatterplot in
which the points are in a nearly circular pattern illustrates zero or near-zero correlation.

The computed value of the perfect positive correlation coefficient is +1.00, and the
computed value of the perfect negative correlation coefficient is -1.00. When no
relationship exists between the two variables, the correlation coefficient is 0.00. The
computed value of the correlation coefficient is a function of the slope of the general
pattern of points in the scotterplot and the width of the ellipse that encloses the points. If
the slope is negative, the sign of the correlation coefficient is negative. If the width of the
ellipse is narrow, the degree of relationship is larger and the correlation coefficient is
larger.

The Pearson Product-Moment Correlation Coefficient


The most commonly used correlation coefficient in the behavioral sciences is the
Pearson product-moment correlation coefficient, or the Pearson r. The level of
measurement required for using the Pearson r are the interval and ratio scales.
The standard score formula for Pearson r is
∑ Zx ZY
r=
N
X− X

Where: Zx = standard score (Z) of X = s X

Y −Y

Zy = standard score of Y = Y s
N = Number of cases or observations

The formula states that all X and Y scores must be converted into standard scores or Z
scores and the product of each pair of Z scores computed. These products then summed
and the sum divided by N. More simply r equals the sum of the cross products or two
variables, in standard score form, divided by N.

Computational Formula for the Correlation Coefficient


The Z-score formula is not generally used to compute the Pearson r, because it requires
converting all observed scores to standard scores. However, by using the definition of
standard scores, it is possible to drive a computational formula that involves only the
observed X and Y scores. The formula is
N ∑ XY −( ∑ X )( ∑ Y )
r=

√[ ( ∑ X ) ][ (∑ Y ) ]
2 2

N∑ X
2
N∑Y
2
− −

where N = number of pairs of scores


X = the sum of the X scores
Y = the sum of the Y sores
X2= the sum of the X squares
Y2= the sum of the Y squares
XY = Sum of products of X and Y scores
The concepts of reliability, validity and usability of tests will be discussed in the
following pages.

Validity

It is of paramount importance that the method of evaluation employed be able to accurately


measure the skill or knowledge that it seeks to measure, that it be valid. It is also important that
evaluations exhibit what is known as face validity. Face validity means that elements of the
evaluation appear to be related to stated course objectives. It is a common student complaint
that they could not perceive the connection between the evaluation and course objectives. It is
therefore necessary not only that the instructor be able to make a connection between the
evaluation and the course, but that the student be able to do so as well.

In addition to face validity, evaluations must have content validity. The format of an evaluation
must conform closely to the course objectives that it seeks to evaluate. If a course objective
states that students will be able to apply theories of practice to case studies, then an evaluation
should provide them with appropriate cases to demonstrate this ability.

Finally, effective methods of evaluation have certain predictive characteristics. A student who
performs well on an evaluation concerning a certain skill might be expected to perform well on
similar evaluations on related skills. Additionally, that student might be expected to score
consistently when evaluated in the future.

When constructing or selecting tests and other evaluation instruments, the most
important question is, to what extent will the interpretation of the scores be
appropriate, meaningful, and useful for the intended application of the results? The
validity of data collected implies that the evaluation has actually been focused on
the subject initially targeted for evaluation. For instance, for the sake of validity,
learners’ written and oral skills cannot be evaluated with the same tests.
Validity refers to:
 The appropriateness, meaningfulness, and usefulness of the specific
inferences made from test scores.
 The consistency (accuracy) with which the scores measure a particular
cognitive ability of interest.
From these definitions we can see that there are two aspects of validity. These are
 What is measured (refers to abilities to perform observable tasks, or
command of substantive knowledge)
 How consistently it is measured? (Refers to the reliability of the scores)
For example: if a test is to be used to describe students’ achievement, we should be
able to interpret the scores as a relevant and representative sample of the achievement
domain to be measured. If the results are to be used as a measure of pupils reading
comprehension, we should like our interpretation to be based on evidence that the scores
actually reflect reading comprehension and are not distorted by irrelevant factors.
Basically, then, validity is always concerned with the specific use of the results and the
soundness of our proposed interpretations.

Reliability is a necessary ingredient of validity but it is not sufficient to ensure validity.


Unless the test scores measure what the test user intends to measure, no matter how
reliably, the scores will not be valid.

When interpreting validity in relation to testing and evaluation, there are certain things to
remember. These are:
1. Validity refers to the appropriateness of the interpretation of the results of a test or
evaluation instrument for a given group of individuals, and the instrument itself.
2. Validity is a matter of degree; it does not exist on an all-or-none basis. Hence, high
validity, moderate validity, and low validity.
3. Validity is always specific to some particular use or interpretation. No test is valid
for all purposes.
4. Validity is a unitary concept. In the most recent revision of the standards, the
traditional view that there are several different types of validity has been discarded.
Instead, validity is viewed as a unitary concept based on various kinds of evidence.

Kinds of validity Evidence


i. Content validity
Content validity is related to how adequately the content of the test samples the
domain about which inferences are to be made. The procedure is to compare the test
tasks to the test specifications describing the task domain under consideration. If the
test specification is carefully constructed and carefully followed in building the test,
it will contribute much to ensure content validity.
There is no single commonly used numerical expression for content validity. It is
determined by:
 Whether or not each item represents the total domain or sub domain;
 Critical inspection of the test items to determine the items represent the
content;
 Inter judge agreement about the match of the items to the domain;
 Building two tests over the same content, giving both to the same set of
students and correlating the results

ii. Criterion-related validity


Criterion-related validity provides information on how well test performance predicts
future performance or estimates current performance on some valued measures other
than the test itself called criterion. The procedure is to compare test scores with
another measure of performance obtained at a later date (for prediction) or with
another measure of performance obtained concurrently (for estimating present
status). For example, scores on a dictation test are a generally accepted as measures
of spelling achievement. We can make a distinction between two kinds of criterion-
related validity. These are:
 Concurrent validity: criteria data are collected at approximately the
same time as the test data.
 Predictive validity: data are gathered at a later date. The concern is with the
usefulness of the test scores in predicting some future performance.
Example:
 Does score on the CEE (college entrance examination) predict freshman program
performance in the university?
 Do supervisor’s ratings predict success on the job?
There is a need to show that a positive relationship exists between scores on the CEE
(the predictor) and grade point average on freshman program performance (the
criterion). If a correlation of say 0.60 is obtained, we might conclude that the
examination is a useful predictor of future performance, that is, there is support for using
the scores to predict success in university. The degree of positive relationship between
the predictor and criterion can be perfect positive correlation(r = 1.00), moderate
positive correlation(r = 0.60), and no correlation(r = 0.00).
One problem in criterion related validity is lack of suitable criteria for validating
achievement tests. This poses a problem for instructors to select satisfactory criteria to
validate their examinations. In that case instructors have to depend on procedures of
logical analysis to ensure valid test interpretation. Accordingly, they are advised to:
 Carefully identify the objectives of instruction
 State the objectives in terms of changes in students’ performance.
 Construct or select evaluation instruments that satisfactorily measure the learning
outcomes sought.

iii. Construct-related validity


When we are interested to interpret test performance in terms of some psychological trait
or quality, we are concerned with construct-related validity evidence. Construct validity
is the degree to which we can infer certain constructs in a psychological theory from the
test scores. A construct is a psychological quality that we assume exists in order to
explain some aspect of behavior.
Examples:
 Mathematical reasoning ability
 Intelligence
 Creativity
 Reading comprehension
 Sociability
 Honesty
For example, rather than speak about a pupils score on a particular mathematics test or
how well it predicts grades in future mathematics courses, we might want to infer that
the pupils possesses a certain degree of mathematical reasoning ability.

The test constructor builds a paper and pencil test to measure mathematical reasoning.
The mathematical reasoning test would be considered to have construct validity to the
degree that test scores are related to the judgments made from observing behavior
identified by the psychological theory as mathematical reasoning. If the anticipated
relationships are not found, then the construct validity of the inference that the test is
measuring mathematical reasoning is not supported. Construct validation has been
commonly used with theory building and theory testing.

Factors influencing validity


The following factors can prevent the test items from functioning as intended and
thereby lower the validity of the interpretations from the test scores.
1. Unclear directions;
2. Reading vocabulary and sentence structure too difficult;
3. Inappropriate level of difficulty of the test items;
4. Poorly constructed test items;
5. Ambiguity;
6. Test items inappropriate for the outcome being measured;
7. Inadequate time limits;
8. Test too short;
9. Improper arrangement of items;
10. Identifiable pattern of answers.

Reliability
The concept of reliability is closely related to (and often confused with) validity. A reliable method of
evaluation will produce similar results (within certain limitations) for the same student across time and
circumstances. Reliability can be defined as the degree of consistency between two measures of
the same thing. It:
 Provides the competency that makes validity possible.
 Indicates how much confidence we can place in our results.
The concept of reliability as applied to testing and evaluation can be clarified by noting the
following general points.
1. Reliability refers to the results (test scores) obtained with an evaluation instrument and not
to the instrument itself.
2. Estimates of reliability always refer to a particular type of consistency. Test scores are not
reliable in general. They are reliable or generalizable over different
 periods of time
 sample of questions
 raters
3. Reliability is necessary but not sufficient condition for validity. A test that produces
totally inconsistent results cannot possibly provide valid information about the
performance being measured. Low reliability can be expected to restrict the degree of
validity that is obtained. High reliability does not ensure that a satisfactory degree of
validity will be present.
4. Reliability is primarily statistical. Shifts in the relative standing of students in the group
or in terms of the amount of variation to be expected in individuals’ score is reported
by means of reliability coefficient or standard error of measurement.

You might also like