Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
54 views52 pages

Alyssa Louise C. Cabije: BS Psychology University of Mindanao

Psychological testing and assessment has its roots in early 20th century France and World War I, where tests were used to screen military recruits. There are three main forms of assessment: collaborative, therapeutic, and dynamic. Psychological assessment uses tools such as tests, interviews, portfolios, case histories, behavioral observations, and role-plays to gather data. Tests measure psychological variables and come in various formats, with differences in administration, scoring, and interpretation procedures. The interview is a method of direct communication to gather information, and can vary in length, purpose, and number of interviewers. Assessment provides information to help with diagnostic, treatment, selection and other decisions.

Uploaded by

Alyssa Cabije
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views52 pages

Alyssa Louise C. Cabije: BS Psychology University of Mindanao

Psychological testing and assessment has its roots in early 20th century France and World War I, where tests were used to screen military recruits. There are three main forms of assessment: collaborative, therapeutic, and dynamic. Psychological assessment uses tools such as tests, interviews, portfolios, case histories, behavioral observations, and role-plays to gather data. Tests measure psychological variables and come in various formats, with differences in administration, scoring, and interpretation procedures. The interview is a method of direct communication to gather information, and can vary in length, purpose, and number of interviewers. Assessment provides information to help with diagnostic, treatment, selection and other decisions.

Uploaded by

Alyssa Cabije
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

CHAPTER 1:

PSYCHOLOGICAL TESTING AND ASSESSMENT

I. TESTING AND ASSESSMENT

a. Roots can be found in early twentieth c. WW1, military used the test to screen large
century in France 1905 numbers of recruits quickly for intellectual
b. Alfred Binet published a test designed to and emotional problems
help place Paris school children d. WW2, military depend more on tests to
screen recruits for service

PSYCHOLOGICAL ASSESSMENT PSYCHOLOGICAL TESTING


DEFINITION Gathering & integration of psychology- Process of measuring psychology-related
related data for the purpose of making a variables by means of devices/procedures
psychological evaluation with accompany designed to obtain a sample of behavior
of tools.
OBJECTIVE To answer a referral question, solve To obtain some gauge, usually numerical
problem or arrive at a decision thru the in nature
use of tools of evaluation
PROCESS Typically individualized Testing may be individualized or group
ROLE OF Key in the process of selecting tests as Tester is not key into the process; may be
EVALUATOR well as in drawing conclusions substituted
SKILL OF Typically requires an educated selection, Requires technician-like skills
EVALUATOR skill in evaluation
OUTCOME Entail logical problem-solving approach to Typically yields a test score
answer the referral ques

e. 3 FORMS OF ASSESSMENT
i. COLLABORATIVE PSYCHOLOGICAL ASSESSMENT: assessor and assesse work as partners from initial contact
through final feedback
ii. THERAPEUTIC PSYCHOLOGICAL ASSESSMENT: self-discovery and new understandings are encouraged
throughout the assessment process
iii. DYNAMIC PSYCHOLOGICAL ASSESSMENT
1. follows a model (a) evaluation (b) intervention (a) evaluation
2. Provide a means for evaluating how the assesse processes or benefits from some type of intervention
during the course of evaluation.
f. TOOLS OF PSYCHOLOGICAL ASSESSMENT
i. The Test (a measuring device or procedure)
1. Psychological test
a. a device or procedure designed to measure variables related to psychology (intelligence,
personality, aptitude, interests, attitudes, or values)
2. Format: refers to the form, plan, structure, arrangement, and layout of test items as well as to related
considerations such as time limits.
a. also referred to as the form in which a test is administered (pen and paper, computer, etc)
Computers can generate scenarios
b. term is also used to denote the form or structure of other evaluative tools, and processes, such
as the guidelines for creating a portfolio work sample
3. Ways That tests differ from one another:
a. Administrative procedures
i. some test administers have an active knowledge
1. some test administration involves demonstration of tasks
2. usually one-on-one
3. trained observation of assessee’s performance
ii. some test administers don’t even have to be present
1. usually administered to larger groups
2. test takers complete tasks independently
b. Scoring and interpretation procedures

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
i. Score: a code or summary statement, usually (but not necessarily) numerical in nature, that
reflects an evaluation of performance on a test, task, interview, or some other sample of
behaviour
ii. Scoring: process of assigning such evaluative codes/ statements to performance on tests,
tasks, interviews, orother behavior samples.
iii. Different types of score:
1. Cut score: reference point, usually numerical, derived by judgement and used to
divide a set of data into two or more classifications
a. sometimes reached without any formal method: in order to “eyeball”, teachers
who decide what is passing and what is failing.
iv. Who scores it
1. self-scored by testtaker
2. computer
3. trained examiner
c. Psychometric soundness/ technical quality
i. Psychometrics: the science of psychological measurement
1. referring to how consistently and how accurately a psychological test measures what
it purports to measure.
ii. Utility: refers to the usefulness or practical value that a test or other tool of assessment has
for a particular purpose.

ii. The Interview: method of gathering information through direct communication involving reciprocal
exchange
1. Interviewer in face-to-face is taking note of:
a. verbal language
b. nonverbal language
i. body language movements
ii. facial expressions in response to interviewer
iii. the extent of eye contact
iv. apparent willingness to cooperate
c. how they are dressed
i. neat vs sloppy vs inappropriate
2. Interviewer over the phone taking note of :
1. changes in the interviewee’s voice pitch
2. long pauses
3. signs of emotion in response
3. Ways that interviews differ:
i. length, purpose, and nature
ii. in order to help make diagnostic, treatment, selection, etc
4. Panel interview
a. an interview conducted with one interviewee with more than one interviewer
iii. The Portfolio
1. Files of work products: paper, canvas, film, video, audio, etc
2. Samples of ones abilities and accomplishments
iv. Case History Data: records, transcripts, and other accounts in written, pictorial or other form that preserve
archival information, official and informal accounts, and other data and items relevant to assesse
1. Sheds light on an individual's past and current adjustment as well as on events and circumstances that
may have contributed to any changes in adjustment
2. Provides information about neuropsychological functioning prior to the occurrence of a trauma or
other event that results in a deficit
3. Insight into current academic and behavioral standing
4. Useful in making judgments for future class placements
5. Case history Study: a report or illustrative account concerning person or an event that was compiled
on the basis of case history data
a. might shed light on how one individual’s personality and particular set of environmental
conditions combined to produce a successful world leader.
b. Groupthink: work on a social psychological phenomenon: contains rich case history material on
collective decision making that did not always result in the best decisions.

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
v. Behavioral Observation: monitoring the actions of others or oneself by visual or electronic means while
recording quantitative and/or qualitative information regarding those actions.
1. Often used as a diagnostic aid in various settings: inpatient facilities, behavioral research laboratories,
classrooms.
2. Naturalistic observation: behavioral observation that takes place in a naturally occurring setting (as
opposed to a research laboratory) for the purpose of evaluation and information-gathering.
3. In practice tends to be used most frequently by researchers in settings such as classrooms, clinics,
prisons, etc.
vi. Role- Play Tests
1. Role play: acting an improvised or partially improvised part in a simulated situation.
2. Role-play test: tool of assessment wherein assessees are directed to act as if they were in a particular
situation. Assessees are then evaluated with regard to their expressed thoughts, behaviors, abilities,
etc
vii. Computers as tools
1. Local processing: on site computerized scoring, interpretation, or other conversion of raw test data;
contrast w/ CP and teleprocessing
2. Central processing: computerized scoring, interpretation, or other conversion of raw data that is
physically transported from the same or other test sites; contrast w/ LP and teleprocessing.
3. Teleprocessing: computerized scoring, interpretation, or other conversion of raw test data sent over
telephone lines by modem from a test site to a central location for computer processing. contrast
with CP and LP
4. Simple score report: a type of scoring report that provides only a listing of scores
5. Extended scoring report: a type of scoring report that provides a listing of scores AND statistical data
6. Interpretive report: a formal or official computer-generated account of test performance presented in
both numeric and narrative form and including an explanation of the findings;
a. The three varieties of interpretive report are:
i. Descriptive
ii. Screening
iii. Consultive
b. Some contain relatively little interpretation and simply call attention to certain high, low, or
unusual scores that needed to be focused on.
c. Consultative report: A type of interpretive report designed to provide expert and detailed
analysis of test data that mimics the work of an expert consultant.
d. Integrative report: a form of interpretive report of psychological assessment, usually computer-
generated, in which data from behavioral, medical, administrative, and/or other sources are
integrated
7. CAPA: computer assisted psychological assessment. (assistance to the test user not the test taker)
a. Enables test developers to create psychometrically sound tests using complex mathematical
procedures and calculations.
b. Enables test users the construction of tailor-made test with built-in scoring and interpretive
capabilities
c. Pros:
i. test administrators have greater access to potential test users because of the global reach
of the internet.
ii. scoring and interpretation of test data tend to be quicker than for paper-and-pencil tests
iii. costs associated with internet testing tendto be lower than costs associated with paper-
and-pencil tests
iv. the internet facilitates the testing of otherwise isolated populations, as well as people with
disabilities for whom getting to a test center might prove as a hardship
v. greener: conserves paper, shipping materials etc.

d. Cons:
i. test client integrity
1. refers to the verification of the identity of the test taker when a test is administered
online
2. also refers to the sometimes varying interests of the test taker vs that of the test
administrator. The test taker might have access to notes, aids, internet resources etc.
3. internet testing is only testing, not assessment

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
8. CAT: computerized adaptive testing: an interactive, computer- administered test taking process
wherein items presented to thetest taker are based in part on the test taker's performance on
previous items
a. EX: on a computerized test of academic abilities, the computer might be programmed to switch
from testing math skills to English skills after three consecutive failures on math items.
9. Other Tools
a. DVD- how would you respond to the events that take place in the video
i. sexual harassment in the workplace
ii. respond to various types of emergencies
iii. diagnosis/treatment plan for clients on videotape
b. thermometers, biofeedback, etc
II. TEST DEVELOPER
a. They are the one who create tests.
b. They conceive, prepare, and develop tests. They also find a way to disseminate their tests, by publishing them
either commercially or through professional publications such as books or periodicals.
III. TEST USER
a. They select or decide to take a specific test off the shelf and use it for some purpose.
b. They may also participate in other roles, e.g., as examiners or scorers.
IV. TEST TAKER
a. Anyone who is the subject of an assessment
b. Test taker may vary on a continuum with respect to numerous variables including:
i. The amount of anxiety they experience & the degree to which the test anxiety might affect the results
ii. The extent to which they understand & agree with the rationale of the assessment
iii. Their capacity & willingness to cooperate
iv. Amount of physical pain/emotional distress they are experiencing
v. Amount of physical discomfort
vi. Extent to which they are alert & wide awake
vii. Extent to which they are predisposed to agreeing or disagreeing when presented with stimulus
viii. The extent to which they have received prior coaching
ix. May attribute to portraying themselves in a good light
c. Psychological autopsy – reconstruction of a deceased individual’s psychological profile on the basis of archival
records, artifacts, & interviews previously conducted with the deceased assesee
V. TYPES OF SETTINGS
a. EDUCATIONAL SETTING
i. Achievement test: evaluation of accomplishments or the degree of learning that has taken place, usually
with regard to an academic area.
ii. Diagnosis: a description or conclusion reached on the basisof evidence and opinion though a process of
distinguishing the nature of something and ruling out alternative conclusions
iii. Diagnostic test: a tool used to make a diagnosis, usually to identify areas of deficit to be targeted for
intervention
iv. Informal evaluation: A typically non systematic, relatively brief, and “off the record” assessment leading to
the formation of an opinion or attitude, conducted by any person in any way for any reason, in an unofficial
context and not subject to the same ethics or standards as evaluation by a professiomal
b. CLINICAL SETTING
i. these tools are used to help screen for or diagnose behavior problems
ii. group testing is used primarily for screening: identifying those individuals who require further diagnostic
evaluation.
c. COUNSELING SETTING
i. Schools,prisons, and governmental or privately owned institutions
ii. ultimate objective: the improvement of the assessee in terms of adjustment, productivity, or some related
variable.
d. GERIATRIC SETTING
i. Quality of life: in psychological assessment, an evaluation of variables such as perceived stress, loneliness,
sources of satisfaction, personal values, quality of living conditions, and quality of friendships and other
social support.
e. BUSINESS AND MILITARY SETTINGS
f. GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING
i. How are Assessments Conducted?

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. protocol: the form or sheet or booklet on which a testtaker’s responses are entered
a. term might also be used to refer to a description of a set of test- or assessment- related
procedures, as in the sentence, “the examiner dutifully followed the complete protocol for the
stress interview”
2. rapport: working relationship between the examiner and the examinee
g. ASSESSEMENT OF PEOPLE WITH DISABILITITES
i. Define who requires alternate assessment, how such assessment are to be conducted and how meaningful
inferences are to be drawn from the data derived from such assessment
ii. Accommodation – adaptation of a test, procedure or situation or the substitution of one test for another to
make the assessment more suitable for an assesee with exceptional needs.
iii. Translate it into Braillee and administere in that form.
iv. Alternate assessment – evaluative or diagnostic procedure or process that varies from the usual,
customary, or standardized way a measurement is derived either by virtue of some special accommodation
made to the assesee by means of alternative methods
v. Consider these four variables on which of many different types of accommodation should be employed:
1. The capabilities of the assesse
2. The purpose of the assessment
3. The meaning attached to test scores
4. The capabilities of the assessor
VI. REFERENCE SOURCES
a. TEST CATALOUGES – contains brief description of the test
b. TEST MANUALS – detailed information
c. REFERENCE VOLUMES – one stop shopping, provides detailed information for each test listed, including test
publisher, author, purpose, intended test population and test administration time
d. JOURNAL ARTICLES – contain reviews of the test
e. ONLINE DATABASES – most widely used bibliographic databases
VII. TYPES OF TESTS
a. INDIVIDUAL TEST – those given to only one person at a time
b. GROUP TEST – administered to more than one person at a time by single examiner
c. ABILITY TESTS
i. ACHIEVEMENT TESTS – refers to previous learning (ex. Spelling)
ii. APTITUDE/PROGNOSTIC – refers to the potential for learning or acquiring a specific skill
iii. INTELLIGENCE TESTS – refers to a person’s general potential to solve problems
d. PERSONALITY TESTS: refers to overt and covert dispositions
i. OBJECTIVE/STRUCTURED TESTS – usually self-report, require the subject to choose between two or more
alternative responses
ii. PROJECTIVE/UNSTRUCTURED TESTS – refers to all possibleuses, applications and underlying concepts of
psychologicaland educational tests
iii. INTEREST TESTS

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 2:
HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS

I. A HISTORICAL PERSPECTIVE
a. 19TH CENTURY
i. Tests and testing programs first came into being in China
ii. Testing was instituted as a means of selecting who, of many applicants would obtain government jobs (Civil
service)
iii. The job applicants are tested on proficiency in endeavors such as music, archery, knowledge and skill etc.
b. GRECO-ROMAN WRITINGS (Middle Ages)
i. World of evilness
ii. Deficiency in some bodily fluid as a factor believed to influence personality
iii. Hippocrates and Galen
c. RENAISSANCE
i. Christian von Wolff – anticipated psychology as a science and psychological measurement as a specialty
within that science
d. CHARLES DARWIN AND INDIVIDUAL DIFFERENCES
i. Tests designed to measure these individual differences in ability and personality among people
ii. “Origin of Species”
iii. chance variation in species would be selected or rejected by nature according to adaptivity and survival
value. “survival of the fittest”
e. FRANCIS GALTON
i. Explore and quantify individual differences between people.
ii. Classify people “according to their natural gifts”
iii. Displayed the first anthropometric laboratory
f. KARL PEARSON
i. Developed the product moment correlation technique.
ii. His work can be traced directly from Galton
g. WILHEM MAX WUNDT
i. First experimental psychology laboratory in University of Leipzig
ii. Focuses more on relating to how people were similar, not different from each other.
h. JAMES MCKEEN CATELL
i. Individual differences in reaction time
ii. Coined the term mental test
i. CHARLES SPEARMAN
i. Originating the concept of test reliability as well as building the mathematical framework for the statistical
technique of factor analysis
j. VICTOR HENRI
i. Frenchman who collaborated with Binet on papers suggesting how mental tests could be used to measure
higher mental processes
k. EMIL KRAEPELIN
i. Early experimenter of word association technique as a formal test
l. LIGHTNER WITMER
i. “Little known founder of clinical psychology”
ii. Founded the first psychological clinic in the U.S.
m. PSYCHE CATELL
i. Daughter of James Cattell
ii. Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in Infants and Young Children
n. RAYMOND CATTELL
i. Believed in lexical approach to defining personality which examines human languages for descriptors of
personality dimensions

o. 20th CENTURY
i. Birth of the first formal tests of intelligence
ii. Testing shifted to be of more understandable relevance/meaning
iii. THE MEASUREMENT OF INTELLIGENCE
1. Binet created first intelligence to test to identify mentally retarded school children in Paris (individual)

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
2. Binet-Simon Test has been revised over again. Group intelligence tests emerged with need to screen
intellect of WWI recruits
3. David Wechsler – designed a test to measure adult intelligence test
a. for him Intelligence is a global capacity of the individual to act purposefully, to think
rationally and to deal effectively with his environment.
b. Wechsler-Bellevue Intelligence Scale
c. Wechsler Adult Intelligence Test – was revised several times and extended the age range of
testakers from young children through senior adulthood.
iv. THE MEASUREMENT OF PERSONALITY
1. Field of psychology was being too test oriented
2. Clinical psychology was synonymous to mental testing
3. ROBERT WOODWORTH – develop a measure of adjustment and emotional stability that could be
administered quickly and efficiently to groups of recruits
a. To disguise the true purpose of the test, questionnaire was labeled as Personal Data Sheet
b. He called it Woodworth Psychoneurotic Inventory – first widely used self-report test of
personality
4. Self-report test:
a. Advantages:
i. Respondents best qualified
b. Disadvantages:
i. Poor insight into self
ii. One might honestly believe something about self that isn’t true
iii. Unwillingness to report seemingly negative qualities
5. Projective test:
a. individual is assumed to project onto some ambiguous stimulus (inkblot, photo, etc.) his or
her own unique needs, fears, hopes, and motivations
b. Ex.) Rorschack inkbloto
v. THE ACADEMIC AND APPLIED TRADITIONS

p. Culture and Assessment


i. Culture: ‘the socially transmitted behavior patterns, beliefs, and products of work of a particular population,
community, or group of people
ii. Evolving Interest in Culture-Related Issues
1. Goddard tested immigrants and found most to be feebleminded
2. invalid; overestimated mental deficiency, even in native English-speakers
3. Lead to nature-nurture debate about what intelligence tests actually measure
4. Needed to “isolate” the cultural variable
iii. Culture-specific tests:
1. tests designed for use with ppl from one culture, but not from another
2. minorities still scored abnormally low
3. ex.) loaf of bread vs. tortillas
4. today tests undergo many steps to ensure its suitable for said nation
a. take testtakers reactions into account
iv. Some Issues Regarding Culture and Assessment
1. Verbal Communication
a. Examiner and examinee must speak the same language
b. Especially tricky with infrequently used vocabulary or unusual idioms employed
c. Translator may lose nuances of translation or give unintentional hints toward more
desirable answer
d. Also requires understanding of culture
2. Nonverbal Communication and Behavior
a. Different between cultures
b. Ex.) meaning of not making eye contact
c. Body movement could even have physical cause
d. Psychoanalysis: Freud’s theory of personality and psychological treatment which
stated that symbolic significance is assigned to many nonverbal acts.
e. Timing tests in cultures not obsessed with speed
f. Lack of speaking could be reverence for elders

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
3. Standards of Evaluation
a. Acceptable roles for women differ throughout culture
b. “judgments as to who might be the best employee, manager, or leader may differ as
a function of culture, as might judgments regarding intelligence, wisdom, courage, and other
psychological variables”
c. must ask ‘how appropriate are the norms or other standards that will be used to
make this evaluation
q. Tests and Group Membership
i. ex.) must be 5’4” to be police officer- excludes cultures with shorts tature
ii. ex.) Jewish lifestyle not well suited for corporate America
iii. affirmative action: voluntary and mandatory efforts to combat discrimination and promote equal
opportunity in education and employment for all
iv. Psychology, tests, and public policy

II. Legal and Ethical Condiseration


Code of professional ethics: defines the standard of care expected of members of a given profession.
a. The Concerns of the Public
i. Beginning in world war I, fear that tests were only testing the ability to take tests
ii. Legislation
1. Minimum competency testing programs: formal testing programs designed to be used in
decisions regarding various aspects of students’ educations
2. Truth-in-testing legislation: state laws to provide testtakers with a means of learning the criteria by
which they are being judged
iii. Litigation
1. Daubert ruling made federal judges the gatekeepers todetermining what expert testimony is
admitted
2. This overrode the Frye policy which only admitted scientific testimony that had won general
acceptance inthe scientific community.
b. The Concerns of the Profession
i. Test-user qualifications
Who should be allowed to use psych tests?
1. Level A: tests or aids that can adequately be administered, scored, and interpreted with the aid of the
manual and a general orientation to the kind of institution or organization in which one is
working
2. Level B: tests or aids that require some technical knowledge of test construction and use and
of supporting psychological and educational fields
3. Level C: tests and aids requiring substantial understanding of testing and supporting psych fields with
experience
ii. Testing people with disabilities
1. Difficulty in transforming the test into a form that can be taken by testtaker
2. Transferring responses to be scorable
3. Meaningfully interpreting the test data
iii. Computerized test administration, scoring, and interpretation
1. simple, convenient
2. easily copied, duplicated
3. insufficient research to compare it to pencil-and-paper versions
4. value of computer interpretation is questionable
5. unprofessional, unregulated “psychological testing” online
c. The Rights of Testtakers
i. the right of informed consent
ii. right to know why they are being evaluated, how test data will be used and what information will
be released to whom
iii. may be obtained by parent or legal representative
iv. must be in written form:
1. general purpose of the testing
2. the specific reason it is being undertaken
3. general type of instruments to be administered
v. revealing this information before the test can contaminate the results

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
vi. deception only used if absolutely necessary
vii. don’t use deception if it will cause emotional distress
viii. fully debrief participants
d. The right to be informed of test findings
i. Formerly test administrators told to give participants only positive information
ii. No realistic information is required
iii. Tell test takers as little as possible about the nature of their performance on a particular test. So that the
examinee would leave the test session feeling pleased and statisfied.
iv. Test takers have the right also to know what recommendations are being made as a consequence
of the test data
e. The right to privacy and confidentiality
i. Private right: “recognizes the freedom of the individual top pick and choose for himself the time,
circumstances, and particularly the extent to which he wishes to share or withhold from others his
attitudes, beliefs, behaviors, and opinions”
ii. Privileged information: information protected by law from being disclosed in legal proceeding. Protects
clients fromdisclosure in judicial proceedings. Privilege belongs to the client not the psychologist.
iii. Confidentiality: concerns matters of communication outside the courtroom
1. Safekeeping of test data: It is not a good policyto maintain all records in perpetuity
f. The right to the least stigmatizing label
i. The standards advise that the least stigmatizing labelsshould always be assigned when reporting
test results.

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 3:
A STATISTICS REFRESHER
I. Why We Need Statistics
a. Statistics are important for purposes of education
i. Numbers provide convenient summaries and allow us to evaluate some observations relative to others
b. We use statistics to make inferences, which are logical deductions about events that cannot be observed directly
i. Detective work of gathering and displaying clues – exploratory data analysis
ii. Then confirmatory data analysis
c. Descriptive statistics: methods used to provide a concise description of a collection of quantitative information
d. Inferential statistics: methods used to make inferences from observations of a small group of people known as a
sample to a larger group of individuals known as a population
II. SCALES OF MEASUREMENT
a. MEASUREMENT – act of assigning numbers or symbols to characteristics of things according to rules. The rules
serves as a guideline for representing the magnitude. It always involves error.
b. SCALE – set of numbers whose properties model empirical properties of the objects to which the numbers are
assigned.
c. CONTINUOUS SCALE – interval/ratio. A scale used to measure continuous variable. Always involves error
d. DISCRETE SCALE – nominal/ordinal used to measure a discrete variable (ex. Female or male)ERROR – collective
influence of all of the factors on a test score.
III. PROPERTIES OF SCALES
a. Magnitude
i. The property of “moreness”
ii. A scale has the property of magnitude if we can say that a particular instance of the attribute represents
more, less, or equal amounts of the given quantity than does another instance
b. Equal Intervals
i. A scale has the property of equal intervals if the difference between two points at any place on the scale
has the same meaning as the difference between two other points that differ by the same number of scale
units
ii. A psychological test rarely has the property of equal intervals
iii. When a scale has the property of equal intervals, the relationship between the measured units and some
outcome can be described by a straight line or a linear equation in the form Y=a+bX
1. Shows that an increase in equal units on a given scale reflects equal increases in the meaningful
correlates of units
c. Absolute 0
i. An Absolute 0 is obtained when nothing of the property being measured exists
ii. This is extremely difficult/impossible for many psychological qualities
IV. TYPES OF SCALES
i. NOMINAL SCALE
1. Simplest form of measurement
2. Classification or categorization
3. Arithmetic operations can be performed with nominal data
4. Ex.) Male or female
5. Also includes test items
a. Ex.) yes/no responses
ii. ORDINAL SCALE
1. Classifies in some kind of ranking order
2. Individuals compared to others and assigned a rank
3. Imply nothing about how much greater one ranking is than another
4. Numbers/ranks do not indicate units of measure
5. No absolute zero point
6. Binet: believed that data derived from intelligence test are ordinal in nature
iii. INTERVAL SCALE
1. In addition to the features of nominal and ordinal scales, contain equal intervals between numbers
2. No absolute zero point
3. Can take average
iv. RATIO SCALE
1. In addition to all the properties of nominal, ordinal, and interval measurement, ratio scale has true
zero point

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
2. Equal intervals between numbers
3. Ex.) measuring amount of pressure hand can exert
4. True zero doesn’t mean someone will receive a score of 0, but means that 0 has meaning
5. NOTE:
a. Permissible Operations
i. Level of measurement is important because it defines which mathematical operations we
can apply to numerical data
ii. For nominal data, each observation can be placed in only one mutually exclusive category
iii. Ordinal measurements can be manipulated using arithmetic
iv. With interval data, one can apply any arithmetic operation to the differences between
scores
1. Cannot be used to make statements about ratios
V. DESCRIBING DATA
a. Raw Score: straightforward, unmodified accounting of performance, usually numerical
b. Distribution: set of scores arrayed for recording or study
i. Frequency Distributions
1. Frequency Distribution: All scores listed alongside the number of times each score occurred
2. Grouped Frequency Distribution: test-score intervals (class intervals), replace the actual test scores
a. Highest and lowest class intervals= upper and lower limits of distribution
3. Histogram: graph with vertical lines drawn at the true limits of each test score (or class interval)
forming TOUCHING rectangles- midpoint in center of bar
4. Bar Graph: rectangles DON’T touch
ii. Frequency Polygon: data illustrated with continuous line connecting the points where test scores or class
intervals meet frequencies
iii. A single test score means more if one relates it to other test scores
iv. A distribution of scores summarizes the scores for a group of individuals
v. Frequency distribution: displays scores on a variable or a measure to reflect how frequently each value was
obtained
1. One defines all the possible scores and determines how many people obtained each of those scores
vi. Income is an example of a variable that has a positive skew
vii. Whenever you draw a frequency distribution or a frequency polygon, you must decide on the width of the
class interval
viii. Class interval: for inches of rainfall is the unit on the horizontal axis

c. Measures of Central Tendency


i. Measure of central tendency: statistic that indicates the average or midmost score between the extreme
scores in a distribution.
ii. The Arithmetic Mean
1. “X bar”
2. sum of observations divided by number of observations
3. Sigma (X/n)
4. Used for interval or ratio data when distributions are relatively normal
iii. The Median
1. The middle score
2. Used for ordinal, interval, and ratio data
3. Especially useful when few scores fall at extremes
iv. The Mode
1. Most frequently-occurring score
2. Bimodal distribution- 2 scores both have highest frequency
3. Only common with nominal data

d. Measures of Variability
i. Variability: indication of how scores in a distribution are scattered or dispersed
ii. The Range
1. Difference between the highest and lowest scores
2. Quick but gross description of the spread of scores
iii. The interquartile and semi-interquartile range
1. Distribution is split up by 3 quartiles, thus making 4 quarters each representing 25% of the scores

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
2. Q2= median
3. Interquartile range measure of variability equal to the difference between Q3 and Q1
4. Semi-interquartile range interquartile range divided by 2
iv. Quartiles and Deciles
1. Quartiles are points that divide the frequency distribution into equal fourths
2. First quartile is the 25th percentile; second quartile is the median, or 50th percentile; third quartile is
the 75th percentile
3. The interquartile range is bounded by the range of scores that represents the middle 50% of the
distribution
4. Deciles are similar but use points that mark 10% rather than 25% intervals
5. Stanine system: converts any set of scores into a transformed scale, which ranges from 1 to 9
v. The average deviation
1. X-mean=x
2. Average deviation= (sum of all deviation scores)/ total number of scores
3. Tells us on average how far scores are from the mean
vi. The Standard Deviation
1. Similar to average deviation
2. But in order to overcome the (+/-) problem, each deviationis squared
3. Standard deviation: a measure of variability equal to the square root of the average squared
deviations about the mean
4. Is square root of variance
5. Variance: the mean of the squares of the difference b/w the scores in a distribution and their mean
a. Found by squaring and summing all the deviation scores and then dividing by the total number
of scores
6. s = sample standard deviation
7. sigma = population standard deviation
VI. SKEWNESS
a. Skewness: nature and extent to which symmetry is absent
b. POSITIVE SKEWED
i. Ex.) test was too hard
c. NEGATIVELY SKEWED
i. ex.) test was too easy
ii. can be gauges by examining relative distances of quartiles from the median
VII. KURTOSIS
a. steepness of distribution
b. platykurtic: relatively flat
c. leptokurtic: relatively peaked
d. mesokurtic: somewhere in the middle
VIII. THE NORMAL CURVE
a. Normal curve: bell-shaped, smooth, mathematically defined curve, highest at center; both sides taper as it
approaches the x-axis asymptotically
i. symmetrical, and thus have mean, median, mode, is same
b. Area under the Normal Curve
i. Tails and body
IX. STANDARD SCORES
a. Standard Score: raw score that has been converted from one scale to another scale, where the latter has
arbitrarily set mean and standard deviation;used for comparison
b. Z-score
i. conversion of a raw score into a number indicating how many standard deviation units the raw score is
below or above the mean of the distribution.
ii. The difference between a particular raw score and the mean divided by the standard deviation
iii. Used to compare test scores with difference scales
c. T-score
i. Standard score system composed of a scale that ranges from 5 standard deviations below the mean to 5
standard deviations above the mean
ii. No negatives
d. Other Standard Scores
i. SAT

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
ii. GRE
iii. Linear transformation: when a standard score retains a direct numerical relationship to the original raw
score
iv. Nonlinear transformation: required when data are not normally distributed, yet comparisons with normal
distributions need to be made
1. Normalized Standard Scores
a. When scores don’t fall on normal distribution
b. “normalizing a distribution involves ‘stretching’ he skewed curve into the shape of a normal
curve and creating a corresponding scale of standard scores, a scale called a normalized standard
score scale”

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 4:
OF TESTS AND TESTING

I. Some Assumptions About Psychological Testing and Assessment


a. Assumption 1: Psychological Traits and States Exist
i. Trait: any distinguishable, relatively enduring way in which one individual varies from another
ii. States: distinguish one person from another but are relatively less enduring
1. Trait term that an observer applies, as well as strength or magnitude of the trait presumed present
a. based on observing a sample of behavior
2. Trait and state definitions also refer to individual variation
3. Make comparisons with respect to the hypothetical average person
4. Samples of behavior:
a. Direct observation
b. Analysis of self-report statements
c. Paper-and-pencil test answers
5. Psychological trait: covers wide range of possible characteristics
a. ex: Intelligence
b. Specific intellectual abilities
c. Cognitive style
d. Psychopathology
6. Controversy regarding how psychological tests exist
7. Psychological tests exist only as constructs: an informed, scientific concept developed or constructed
to describe or explain a behavior
a. Cant see, hear or touch infer existence from overt behavior: refers to an observable action or
the product of an observable action, including test- or assessment-related responses
8. Traits not expected to be manifested in behavior 100% of the time
a. Seems to be rank-order stability in personality traits relatively high correlations between trait
scores at different time points
9. Whether and to what degree a trait manifests itself is dependent on the strength and nature of the
situation
b. Assumption 2: Psychological Traits and States Can Be Quantified and Measured
i. After acknowledged that psychological traits and states do exist, the specific traits and states to be
measured need to be defined
1. What types of behaviors are assumed to be indicativeof trait?
2. Test developer has to provide test users with a clear operational definition of the construct under
study
ii. After being defined, test developer considers types of item content that would provide insight into it
1. Ex: behaviors that are indicative of a particular trait
iii. Should all questions be weighted the same?
1. Weighting the comparative value of a test’s items comes about as the result of a complex interplay
among many factors:
a. Technical considerations
b. The way a construct has been defined (for particular test)
c. Value society (and test developer) attach to behaviors evaluated
iv. Need to find appropriate ways to score the test and interpret results
1. Cumulative scoring: test score is presumed to represent the strength of the targeted ability or trait or
state
a. The more the testtaker responds in a particular direction (as keyed by test manual) the higher
the testtaker is presumed to possess the targeted trait or ability
c. Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior
i. Objective of test is to provide some indication of some aspects of the examinee’s behavior
1. Tasks on some tests mimic the actual behaviors that the test user is attempting to understand
ii. Obtained behavior is usually used to predict future behavior
iii. Could also be used to postdict behaviour -- to aid in the understanding of behavior that has already taken
place
iv. Tools of assessment, such as a diary, or case history data, might be of great value in such an evaluation
d. Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses
i. Competent test users understand a lot about the tests they use

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. How it was developed
2. Circumstances under which it is appropriate to administer the test
3. How test should be administered and to whom
4. How results should be interpreted
ii. Understand and appreciation limitations for tests they use
e. Assumption 5: Various Sources of Error Are Part of the Assessment Process
i. Everyday-- error= misstates and miscalculations
ii. Assessment-- error= a long-standing assumption that factors other than what a test attempts to measure
will influence performance on a test
iii. Error variance: component of a test score attributable to sources other than the trait or ability measured
1. Assessees themselves are sources of error variance
iv. Classical test theory (CTT)/ True score theory: assumption is made that each testtaker has a true score on a
test that would be obtained but for the action of measurement error
f. Assumption 6: Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner
i. Court challenged to various tests and testing programs have sensitized test developers and users to the
societal demand for fair tests used in a fair manner
1. Publishers strive to develop instruments that are fair when used in strict accordance with guidelines in
the test manual
ii. Fairness related problems/questions:
1. Culture is different from people whom the test was intended for -- Politics
g. Assumption 7: Testing and Assessment Benefit Society
i. Many critical decisions are based on testing and assessment procedures

II. WHAT’S A “GOOD TEST”?


a. Criteria
i. Clear instruction for administration, scoring, and interpretation
b. Reliability
i. A “good test”/measuring tool --- reliable
1. Involves consistency: the prevision with which the test measures and the extent to which error is
present in measurements
2. Unreliable measurement needs to be avoided
c. Validity
i. Test is considered valid if it doesn’t indeed measure what it purports to measure
ii. If there is controversy over the definition of a construct then thevalidity is sure to be criticized as well
iii. Questions regarding validity focus on the items that collectively make up the test
1. Adequately sample range of areas to measure construct
2. Individual items contribute to or take away from test’s validity
iv. Validity may also be questioned on grounds related to the interpretation of test results
d. Other Considerations
i. “Good test”--- one that trained examiners can administer, score and interpret with minimum difficulty
1. Useful
2. Yields actionable results that will ultimately benefit individual testtakers or society at large
ii. Purpose of test --- compare performance of testtaker with performance of other testtakers (contains
adequate norms: normative data)
1. Normative data provides standard with which results measured can be compare
III. NORMS
a. Norm-referenced testing and assessment: method of evaluation and a way of deriving meaning from test scored
by evaluating an individual testtaker’s score and comparing it to scores of a group of testtakers
b. Meaning of individual score is relative to other scores on the sametest
c. Norms (scholarly context): usual, average, normal, standard, expected or typical
d. Norms (psychometric context): the test performance data of a particular group of testtakers that are designed for
use as a reference when evaluating or interpreting individual test scores
e. Normative sample: group of people whose performance on a particular test is analyzed for reference in
evaluation the performance of individual testtakers
i. Yields a distribution of scores
f. Norming: refers to the process of deriving norms; particular type of norm derivation
g. Race norming: controversial practice of norming on the basis of race or ethnic background

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
h. Norming a test can be very expensive---user norms/program norms: consist of descriptive statistics based on a
group of testtakers in a given period of time rather than norms obtained by form sampling methods
i. Sampling to Develop Norms
j. Standardization: process of administering a test to a representative sample of testtakers for the purpose of
establishing norms
i. Standardized when has clear, specified procedures
k. Sampling
i. Developer targets defined group as population test designed for
1. All have at least one common, observable characteristic
ii. To obtain distribution of scores:
1. Test administered to everyone in targeted population
2. Administer test to a sample of the population
a. Sample: portion of universe of people deemed to be representative of whole population
b. Sampling: process of selecting the portion of universe deemed to be representative of whole
iii. Subgroups within a defined population may differ with respect to some characteristics and it is sometimes
essential to have these differences proportionately represented in sample
1. Stratified sampling: sample reflects statistics of whole population; helps prevent sampling bias and
ultimately aid in interpretation of findings
2. Purposive sampling: arbitrarily select sample we believe to be representative of population
3. Incidental/convenience sampling: sample that is convenient or available for use
a. Very exclusive (contain exclusionary criteria)
IV. TYPES OF STANDARD ERROR
a. Standard error of measurement: estimate the extent to which an observed score deviates from a true score
b. Standard error of estimate – In regression, an estimate of the degree of error involved in predicting the value of
one variable from another
c. Standard error of the mean – a measure of sampling error
d. Standard error of the difference – estimate how large a difference between two scores should be before the
difference is considered statistically significant
e. Developing norms for a standardized test
i. Establish a standard set of instructions and conditions under which the test is given--- makes scores of
normative sample more comparable with scores of future testtakers
ii. All data collected and analyzed, test developer will summarize data using descriptive statistics (measures of
central tendency and variability)
1. Test developer needs to provide precise description of standardization sample itself
2. Descriptions of normative samples vary widely in detail
f. Tracking
i. Comparisons are usually with people of the same age
ii. Children at the same age level tend to go through different growth patterns
iii. Pediatricians must know the child’s percentile within a given age group
iv. This tendency to stay at about the same level relative to one’s peers is known as tracking (ie height and
weight)
v. Diets may alter this “track”
vi. Faults: some believe there is an analogy between the rates of physical growth and the rates of intellectual
growth
1. Some say that children learn at different rates
2. This system discriminates against some children
V. TYPES OF NORMS
a. Classification of norms-- ex: age, grade, national, local, percentile, etc.
b. PERCENTILES
i. Median= 2nd quartile: the point at or below which50% of the scores fell and above which the remaining
50% fell
ii. Might wish to divide distribution of scores into deciles (instead of quartiles): 10 equal parts
iii. The Xth percentile is equal to the score at or below which X% of scores fall
iv. Percentile: an expression of the percentage of people whose score on a test or measure falls below a
particular raw score
1. Percentage correct: refers to the distribution of raw scores (number of items that were answered
correctly) multiplied by 100 and divided by the total number of items *not same as percentile
2. Percentile is a converted score that refers to a percentage of testtakers

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
v. Percentiles are easily calculated -- popular way oforganizing test related data
vi. Using percentiles with normal distribution-- real differences between raw scores may be minimized near
the ends of the distribution and exaggerated in the middle (worsens with highly skewed data)
c. AGE NORMS
i. Age-equivalent scores/age norms: indicate the average performance of different samples of testtakers who
were at various ages at the time the test was administered
1. Age norm tables for physical characteristics
2. “Mental” age vs. physical age (need to identify mental age)
d. GRADE NORMS
i. Grade norms: designed to indicate the average test performance of testtakers in a given school grade
1. Developed by administering the test to representative samples of children over a range of consecutive
grades
2. Mean or median score for children at each grade level is calculated Great intuitive appeal
3. Do not provide info as to the content or type of items that a student could or could not answer
correctly
ii. Developmental norms: (ex: grade norms and age norms) term applied broadly to norms developed on the
basis of any trait, ability, skill, or other characteristic that is presumed to develop, deteriorate, or otherwise
be affected by chronological age, school grade, or stage of life
e. NATIONAL NORMS
i. National norms: derived from a normative sample that was nationally representative of the population at
the time the norming study was conducted
f. NATIONAL ANCHOR NORMS
i. Many different tests purporting to measure the same human characteristics or abilities
ii. National anchor norms: equivalency tables for scores on tests that purpose to measure the same thing
1. Could provide the tool for comparisons
2. Provides stability to test scores by anchoring them to other test scores
3. Begins with the computation of percentile norms for each test to be compared
4. Equipercentile method: equivalency of scores on different tests is calculated with reference to
corresponding percentile scores
g. SUBGROUP NORMS
i. Normative sample can be segmented by an criteria initially used in selecting subjects for sample
ii. Subgroup norms: result of segmentation; more narrowly defined
h. LOCAL NORMS
i. Local norms: provide normative info with respect to the local population’s performance on some test
ii. Typically developed by test users themselves
i. Fixed Reference Group Scoring Systems
1. Norms provide context for interpreting meaning of a test score
2. Fixed reference group scoring system: distribution of scored obtained on the test from one group of
testtakers (fixed reference group) is used as the basis for the calculation of test scores for future
administrators on the test
a. Ex: SAT test (developed in 1962)
j. NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION
i. Way to derive meaning from test score is to evaluate test score in relation to other scores on same test
(Norm-referenced)
ii. Criterion-referenced: derive meaning from a test score by evaluating it on the basis of whether or not some
criterion has been met
1. Criterion: a standard on which a judgment or decision may be based
2. Criterion-referenced testing and assessment: method of evaluation and way of deriving meaning from
test scores by evaluating an individual’s score with reference to a set standard (ex: to drive must past
driving test)
a. Derives from values and standards of an individual or organization
b. Also called Domain/content-referenced testing and assessment
c. Critique: if followed strictly, important info about individual’s performance relative to others can
be potentially lost
k. Culture and Inference
i. Culture is a factor in test administration, scoring and interpretation
ii. Test user should do research in advance on test’s available norms to check how appropriate it is for
targeted testtaker population

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. Helpful to know about the culture of the testtaker
VI. CORRELATION AND INFERENCE
a. CORRELATION
i. Degree and direction of correspondence between two things.
ii. Correlation coefficient (r) – expresses a linear relationship between two continuous variables
1. Numerical index that tells us the extent to which X and Y are “co-related”
iii. Positive correlation: high scores on Y are associated with high scores on X, and low scores on Y correspond
to low scores on X
iv. Negative correlation: higher scores on Y are associated with lowerscores on X, and vise versa
v. No correlation: the variables are not related:
vi. 1 to 1
vii. Correlation does not imply causation.
1. Ie weight, height, intelligence
b. PEARSON r
i. Pearson Product Moment Correlation Coefficient
ii. Devised by Karl Pearson
iii. Relationship of two variables are linear and continuous
iv. Coefficient of Determination (r2) – indication of how much variance is shared by the X and the Y variables
c. SPEARMAN RHO
i. Rank order correlation coefficient
ii. Developed by Charles Spearman
iii. Used when the sample size is small and when both sets of measurements are in ordinal form (ranking form)
d. BISERIAL CORRELATION
i. expresses the relationship between a continuous variable and an artificial dichotomous variable
1. If the dichotomous variable had been true then we would use the point biserial correlation
2. When both variables are dichotomous and at least one of the dichotomies is true, then the
association between them can be estimated using the phi coefficient
3. If both dichotomous variables are artificial, we might use a special correlation coefficient – tetrachoric
correlation
e. REGRESSION
i. analysis of relationships among variables for the purpose of understanding how one variable may predict
another
ii. SIMPLE REGRESSION: one IV (X) and one DV (Y)-
iii. Regression line: defined as the best-fitting straight line through a set of points in a scatter diagram
1. Found by using the principle of least squares, which minimizes the squared deviation around the
regressionline
iv. Primary use: To predict one score or variable from another
v. Standard error of estimate: the higher the correlation between X and Y, the greater the accuracy of the
prediction and the smaller the SEE.
vi. MULTIPLE REGRESSION: The use of more than one score to predict Y.
1. Regression coefficient: (b) slope of the regression line
a. Sum of squares for the covariance to the sum of squares for X
b. Sum of squares is defined as the sum of the squared deviations around the mean
c. Covariance is used to express how much two measures covary, or vary together
2. Slope describes how much change is expected in Y each time X increases by one unit
3. Intercept (a) is the value of Y when X is 0
a. The point at which the regression line crosses the Y axis
f. THE BEST-FITTING LINE
i. The difference between the observed and predicted score (Y-Y’) is called the residual
ii. The best-fitting line is most appropriately found by squaring each residual
iii. Best-fitting line is obtained by keeping these squared residuals as small as possible
1. Principle of least squares:
iv. Correlation is a special case of regression in which the scores for both variables are in standardized, or Z,
units
v. In correlation, the intercept is always 0
vi. Pearson product moment correlation coefficient is a ratio used to determine the degree of variation in one
variable that can be estimated from knowledge about variation in the other variable
vii. Testing the Statistical Significance of a Correlation Coefficient

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. Begin with the null hypothesis that there is no relationship between variables
2. Null hypothesis rejected is there is evidence that the association between two variables is significantly
different from 0
3. t distribution is not a single distribution, but a family of distributions, each with its own degrees of
freedom
4. Degrees of freedom are defined as the sample size minus 2, or N-2
5. Two-tailed test
viii. How to Interpret a Regression Plot
1. Regression plots are pictures that show the relationship between variables-
2. Common use of correlation is to determine the criterion validity evidence for a test, or the
relationship between a test score and some well-defined criterion
3. Middle level of enjoyableness because it is the one observed mostfrequently – normative because it
uses info gained from representative groups
4. Using the test as a predictor is not as good as perfect prediction, but it is still better than using the
normative info
5. A regression line such as in 3.9 shows that the test score tells us nothing about the criterion beyond
the normative info
g. TERMS AND ISSUES IN THE USE OF CORRELATION
i. Residual
1. Difference between the predicted and the observed values is called the residual
a. Y-Y’
2. Important property of residual is that the sum of the residuals always equals 0
3. Sum of the squared residuals is the smallest value according to theprinciple of least squares
4. Standard Error of Estimate
a. Standard deviation of the residuals is the standard error of estimate
b. A measure of the accuracy of prediction
c. Prediction is most accurate when the standard error of estimate is relatively small
ii. Coefficient of Determination
1. Correlation coefficient squared is known as the coefficient of determination
2. Tells us the proportion of the total variation in scores on Y that we know as a function of information
about X
iii. Coefficient of Alienation
1. Coefficient of alienation is a measure of non-association between two variables
2. Square root of 1-r2 –-- r is the coefficient of determination
3. High value means there is a high degree of non-association between 2 variables

iv. Shrinkage
1. Tendency to overestimate the relationship, particularly if the sample of subjects is small
2. Shrinkage is the amount of decrease observed when a regression equation is created for one
population and then applied to another
v. Cross Validation
1. Use regression equation to predict performance in a group of subjects other than the ones to which
the equation was applied
2. Standard error of estimate obtained for relationship between the values predicted by the equation
and the values actually observed– called cross validation
vi. The Correlation-Causation Problem
1. Experiments are required to determine whether manipulation of one variable causes changes in
another variable
2. A correlation alone does not prove causality, although it might lead to other research that is designed
to establish the causal relationships between variables
vii. Third Variable Explanation
1. Third variable, ie poor social adjustment, causes TV viewing and aggression
2. External influence is the third variable
viii. Restricted Range
1. Correlation and regression use variability on one variable to explain variability on a second variable
2. Restricted range problem: correlation requires variability; if the variability is restricted, then significant
correlations are difficult to find
ix. Mulvariate Analysis

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. Multivariate analysis considers the relationship among combinations of three of more variables
x. General Approach
a. Linear combination of variables is a weighted composite of the original variables
b. Y’ = a+b1X1 + … bkX

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 5:
RELIABILITY
I. RELIABILITY
a. Dependability and consistent
b. Error implies that there will always be some inaccuracy in our measurements
c. Tests that are relatively free of measurement error are deemed to be reliable
d. Reliability estimates in the range of .70 and .80 are good enough for most purposes in basic research
e. Reliability coefficient: an index that indicates the ratio between the true score variance on a test and the total
variance
II. HISTORY OF RELIABILITY
a. Charles Spearman (1904): The Proof and Measurement of Association between Two Things
b. Then Thorndike
c. Item response theory has taken advantage of computer technology to advance psychological measurement
significantly
d. Based on Spearman’s ideas
e. X = T + E --- CLASSICAL TEST THEORY
i. assumes that each person has a true score that would be obtained if there were no errors in measurement
ii. Difference between the true score and the observed score results from measurement error
iii. Assumption here is that errors of measurement are random
iv. Basic sampling theory tells us that the distribution of random errors is bell-shaped
1. The center of the distribution should represent the true score, and the dispersion around the mean of
the distribution should display the distribution of sampling errors
v. Classical test theory assumes that the true score for an individual will not change with repeated applications
of thesame testo
vi. Variance: standard deviation squared. It is useful because it can be broken into components:
vii. True variance: variance from true differences --- are assumed to be stable
viii. Error variance: random irrelevant sources
ix. Standard error of measurement: we assume that the distribution of random errors will be the same for all
people, classical test theory uses the standard deviation of errors as the basic measure of error
1. Standard error of measurement tells us, on the average, how much a score varies from the true score
2. Standard deviation of the observed score and the reliability of the test are used to estimate the
standard error of measurement
x. Reliability: proportion of the total variance attributed to true variance.
1. the greater portion of total variance attributed to true variance, the more reliable the test
xi. Measurement error: refers to collectively, all of the factors associated with the process of measuring some
variable, other than the variable being measured
1. Random error: a source of error in measuring a targeted variable caused by unpredictable fluctuations
and inconsistencies of other variables in the measurement process
a. this source of error fluctuates from one testing situation to another with no discernible pattern
that would systematically raise or lower scores
2. Systematic Error:
a. A source of error in measuring a variable that is typically constant or proportionate to what is
presumed to be true value of the variable being measured
b. Error is predictable and fixable
c. Does not affect score consistency
III. SOURCES OF ERROR VARIANCE
a. TEST CONSTUCTION
i. item sampling or content sampling : refer to variation among items within a test as well as to variation
among items between test
1. The extent to which a test takers score is affected by the content sampled on a test and by the way the
content is sampled (that is, the way in which the item is constructed) is a source of error variance
b. TEST ADMINISTRATION
i. may influence the test takers attention or motivation
ii. Environment variables, test taker’s variables, examiner variables. Level of professionalism
c. TEST SCORING AND INTERPRETATION
i. Computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated
error variance caused by scorer differences
ii. However, other tools of assessment still require scoring by trained personnel

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
iii. If subjectivity is involved in scoring, then the scorer can be a source of error variance
iv. Despite rigorous scoring criteria set forth in many of the better known test of intelligence, examiner
occasionally still are confronted by situations where an examinees response lies in a gray area
d. TEST-RETEST RELIABILITY
i. Also known as time-sampling reliability
ii. Correlating pairs of scores from the same group on two different administration of the same test
iii. Measure something that is relatively stable over time
iv. Sources of Error variance:
1. Passage of time: the longer the time that passes, the greater the likelihood that reliability coefficient
will be lower.
2. Coefficient of stability: when the interval between testing is greater than 6 months,
v. Consider possibility of carryover effect: occurs when first testing session influences scores from the second
session
vi. If something affects all the test takers equally, then the results are uniformly affected and no net errors
occurs
vii. Practice tests may make this effect happen
viii. Practice can also affect tests of manual dexterity
ix. Time interval between testing sessions must be selected and evaluated carefully
x. Poor test-retest correlations do not always mean that a attest is unreliable – suggest that the characteristic
under study has changed
e. PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
i. compares two equivalent forms of a test that measure the same attribute
ii. Two forms should be equally constructed, both format, etc.
iii. When two forms of the test are available, one can compare performance on one form versus the other –
equivalent forms reliability or parallel forms
iv. Coefficient of equivalence: degree of relationship between various forms of a test can be evaluated by
means of an alternate-forms
v. Parallel forms: each form of the test, the means and variances of observed test scores are equal
vi. Alternate forms: different versions of a test that have been constructed so as to be parallel
vii. (1) two test administrations with the same group are required & (2) test scores may be affected by factors
such as motivation etc.
viii. Problem: developing a new version of a test
f. INTERNAL CONSISTENCY
i. How well does each item measure the content/construct under consideration
ii. How consistent the items together
iii. Used when tests are administered once
iv. If all items on a test measure the same construct, then it has a good internal consistency
v. Split-half reliability, KR20, Cronbach Alpha
g. SPLIT-HALF RELIABILITY
i. Correlating two pairs of scores obtained from equivalent halves of a single test administered once.
ii. This is useful when it is impractical to assess reliability with two tests or to administer test twice
iii. Results of one half of the test are then compared with the results of the other
iv. Rules in splitting forms into half:
1. Do not divide test in the middle because it would lower the reliability
2. Different amounts of anxiety and differences in item difficulty shall also be considered
3. Randomly assign items to one or the other half of the test
4. use the odd-even system: where one subscore is obtained for the odd-numbered items in the test and
another for the even-numbered items
v. To correct for half-length, apply the Spearman-Brown formula, which allows you to estimate what the
correlation between the two halves would have been if each half had been the length of the whole test
vi. Use this if test user wish to shorten a test
vii. Used to determine the number of items needed to attain a desired level of reliability
viii. Reliability increases as the test length increases
h. KUDER-RICHARDSON FORMULAS OR KR20/KR21
i. Kuder-Richardson technique simultaneously considers all possible ways of splitting the items
ii. The formula for calculating the reliability of a test in which the items are dichotomous, scored 0 or 1, is the
Kuder-Richardson 20 (see p.114)
iii. Introduced KR21 – uses an approximation of the sum of the pq products – the mean test score

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
i. CRONBACH ALPHA
i. Cronbach developed a formula that estimates the internal consistency of tests in which the items are not
scored as 0 or 1 – a more general reliability estimate, which he called coefficient alpha
ii. Sum the individual item variances
1. Most general method of finding estimates of reliability through internal consistency
iii. Domain sampling: define a domain that represents a single trait or characteristic, and each item is an
individual sample of this general characteristic
iv. Factor analysis deals with the situation in which a test apparently measures several different characteristics
1. Good for the process of test construction
v. Most widely used as a measure of reliability because it requires only one administration of the test
vi. Ranges from 0 to 1 “bigger is always better”
j. Other Methods of Estimating Internal Consistencies
i. Inter-item consistency: refers to the degree of correlation among all the items on a scale
1. A measure of inter-item consistency is calculated from a single administration of a single form of a test
2. An index of inter-item consistency, in turn, is useful in assessing the homogeneity of the test
3. Tests are said to be homogenous if they contain items that measure a single trait
4. Definition: the degree to which a test measures a single factor
5. Heterogeneity: degree to which a test measures different factors
6. Ex: homo=test that assesses knowledge only of #-D television repair skills vs. a general electronics repair
test (hetero)
7. The more homogenous a test is, the more inter-item consistency it can be expected to have
8. Test homogeneity is desirable because it allows relatively straightforward test-score interpretation
9. Test takers with the same score on a homogenous test probably have similar abilities in the area tested
10. Test takers with the same score on a heterogeneous test may have quite different abilities
11. However, homogenous testing is often an insufficient tool for measuring multifaceted psychological
variable such as intelligence or personality
k. Measures of Inter-Scorer Reliability
i. In some types of tests under some conditions, the score may be more a function of the scorer than of
anything else
ii. Inter-scorer reliability: is the degree of agreement or consistency between two or more scorers (or judges or
rather) with regard to a particular measure
iii. Coefficient of inter-scorer reliability: coefficient of correlation to determine the degree of consistency among
scorers in the scoring of a test
iv. Kappa statistic is the best method for assessing the level of agreement among several observers
1. Indicates the actual agreement as a proportion of the potential agreement following the correction for
chance agreement
2. Cohen’s Kappa – 2 raters
3. Fleiss’ Kappa – 3 or more raters
IV. HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
a. Homogeneous items has high degree of reliability
V. DYNAMIC VS. STATIC CHARACTERISTICS
a. Dynamic: trait, state, ability presumed to be ever-changing as a function of situational and cognitive experiences
b. Static: trait, state, ability relatively unchanging
VI. RESTRICTION OR INFLATION OF RANGE
a. If it is restricted, reliability tends to be lower.
b. If it is inflated, reliability tends to be higher.
VII. SPEED TESTS VS. POWER TESTS
a. Speed test: test is homogenous, means that it is easy but short time
b. Power test: Few items, but more complex
VIII. CRITERION-REFERENCED TESTS
a. Provide an indication of where a testtaker stands with respect to some variable or criterion.
b. Tends to contain material that has been mastered in hierarchical fashion.
c. Scores here tend to be interpreted in pass-fail terms.
d. Measure of reliability depends on the variability of the test scores: how different the scores are from one
another.
e. The Domain Sampling Model
i. this model considers the problems created by using a limited number of items to represent a larger and more
complicated construct

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
ii. Our task in reliability analysis is to estimate how much error we would make by using the score from the
shorter test as an estimate of your true ability
iii. Conceptualizes reliability as the ratio of the variance of the observed score on the shorter test and the
variance of the long-run true score
iv. Reliability can be estimated from the correlation of the observed test score with the true score

f. Item Response Theory


i. Classical test theory requires that exactly the same test items be administered to each person – BAD-
ii. Item response theory (IRT) is newer – computer is used to focus on the range of item difficulty that helps
assess an individual’s ability level
1. More reliable estimate of ability is obtained using a shorter test with fewer items
2. Takes a lot of items and effort
g. Generalizability theory
i. based on the idea that a persons test scores vary from testing to testing because of variables in the testing
situation
ii. Instead of conceiving of all variability in a persons scores as error, Cronbach encouraged test developers and
researchers to describe the details of the particular test situation or universe leading to a specific test score
iii. This universe is described in terms of its facets: which include things like the number of items in the test, the
amount of training the test scorers have had, and the purpose of the test administration
iv. According to generalizability theory, given the exact same conditions of all the facets in the universe, the
exact same test score should be obtained
v. Universe score: the test score obtained and its analogous to a true score in the true score model
vi. Cronbach suggested that tests be developed with the aid of a generalizability study followed by a decision
study
vii. Generalizability study: examines how generalizable scores from a particular test are if the test is
administered in different situations -How much of an impact different facets of the universe have on the
testscore
viii. Ex: is the test score affected by group as opposed to individual administration
ix. Coefficients of generalizability: the influence of particular facts on the test score is represented by this. These
coefficients are similar to reliability coefficients in the true score model
x. Decision study: developers examine the usefulness of test scores in helping the test user make decision -The
decision study is designed to tell the test user how test scores should be used and how dependable those
scores are as a basis for decisions, depending on the context of their use
h. What to Do About Low Reliability
i. Two common approaches are to increase the length of the test and to throw out items that run down the
reliability
ii. Another procedure is to estimate what the true correlation would have been if the test did not have
measurement error
i. Increase the Number of Items
i. The larger the sample, the more likely that the test will represent the true characteristic
1. This could entail a long and costly process however
ii. Prophecy formula
j. Factor and Item Analysis
i. Reliability of a test depends on the extent to which all of the items measure one common characteristic
ii. Factor analysis
1. Tests are most reliable if they are uni-dimensional: one factor should account for considerably more of
the variance than any other factor
iii. Or examine the correlation between each item and the total scorefor the test
1. Called discriminability analysis: when the correlation between the performance on a single item and the
total test score is low, the item is probably measuring something different from the other items on the
test
k. Correction for Attenuation
i. Potential correlations are attenuated, or diminished, by measurement error

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 6:
VALIDITY
I. The Concept of Validity
a. Validity: as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure
in a particular context
i. Judgment based on evidence about the appropriateness of inferences drawn from test scores
ii. Validity of test must be shown from time to time to account for culture and advancement
b. Inference: a logical result or deduction
c. “Acceptable” or “weak” --- validity of tests and test scores
d. Validation: process of gathering and evaluating evidence about validity
i. Test user and testtaker both have roles in validation of test
ii. Test users may conduct their own validation studies: may yield insights regarding a particular population of
testtakers as compared to the norming sample (in manual)
iii. Local validation studies: absolutely necessary when test user plans to alter in some way the format,
instructions, language, or content of the test
e. Types of Validity (Trinitarian view) *not mutually exclusive-- all contribute to a unified picture of a test’s validity/
critique-- approach is fragmented and incomplete
i. Content validity: measure of validity based on an evaluation of the subjects, topics, or content covered by
the items in the test
ii. Criterion-related validity: measure of validity obtained by evaluating the relationship of scores obtained on
the test to scores on other tests or measures
iii. Construct validity: measure of validity that is arrived at by executing a comprehensive analysis of: (umbrella
validity-- every other variety of validity falls under it)
1. How scores on test relate to other test scores and measures
2. How scores on test can be understood within some theoretical framework for understand the
construct that the test was designed to measure
f. Strategies: ways of approaching the process of test validity
i. Content validation strategies
ii. Criterion-related validation strategies
iii. Construct validation strategies
g. Face Validity
i. relates more to what a test appears to measure to the person being tested than to what the test actually
measures
ii. Judgment concerning how relevant the test items appear to be-- usually from testtaker, not test user
iii. Lack of face validity= lack of confidence in perceived effectiveness of test which decreases testtaker’s
motivation/cooperation *may still be useful
h. Content validity
i. a judgment of how adequately a test samples behavior representative of the universe of behavior that the
test was designed to sample
1. Ideally, test developers have a clear vision of the construct being measured -- clarity reflected in the
content validity of the test
ii. Test blueprint: structure of the evaluation; a plan regarding the types of information to be covered by the
items, the number of items tapping each area of coverage, the organization of the items in the test, etc.
1. Behavior observation is a technique frequently used in test blueprinting
iii. The quantification of content validity
1. Important in employment settings -- tests used to hire and promote
2. One method: method for gauging agreement among raters or judges regarding how essential a
particular item is (C.H. Lawshe)
a. “Is the skill or knowledge measured by this item…
i. Essential
ii. Useful but not essential
iii. Not necessary
b. To the performance of the job?”
c. Content validity ratio

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
iv. Culture and the relativity of content validity
1. Tests thought of as either valid or invalid
2. What constitutes historical fact depends to some extent on who is writing the history
a. Culture relativity
b. Politics (politically correct)
i. Criterion-Related Validity
i. judgment of how adequately a test score can be used to infer an individual’s most probable standing on
some measure of interest (measure of interest being the criterion)
ii. 2 types:
1. Concurrent validity: index of the degree to which a test score is related to some criterion measure
obtained at the same time (concurrently)
2. Predictive validity: index of the degree to which a test score predicts some criterion measure
iii. What Is a Criterion?
1. Criterion: a standard on which a judgment or decision may be based; standard against which a test or
a test score is evaluated (criterion-related validity)
2. Characteristics of criterion
a. Relevancy-- pertinent or applicable to the matter at hand
b. Validity (for the purpose which it is being used)
c. Uncontaminated-- Criterion contamination: term applied to a criterion measure that has been
based, at least in part, on predictor measures
iv. Concurrent Validity
1. Test scores are obtained at about the same time as the criterion measures are obtained--measures of
the relationship between the test scores and the criterion provide evidence of concurrent validity
2. Indicate the extent to which test scores may be used to estimate an individuals present standing on a
criterion
3. Once validity of inference from test scores is established= faster, less expensive way to offer a
diagnosis or a classification decision
4. Concurrent validity of a test can be explored with respect to another test
a. Prior research must have satisfactorily demonstrated the 1st test’s validity
b. 1st test= validating criterion
v. Predictive validity
1. Test scores may be obtained at one time and the criterion measures obtained at a future time, usually
after some intervening event has taken place
a. Intervening event-- training, experience, therapy,medication, etc.
b. Measures of relationship between the test scores and a criterion measure obtained at a future
time provide an indication of the predictive validity test (how accurately scores on the test
predict some criterion measure)
2. Ex: SAT test score and freshman gpa
3. Judgments of criterion validity are based on 2 types of statistical evidence:
a. The validity coefficient
i. Validity coefficient: correlation coefficient that provides a measure of the relationship
between test scores and scores on the criterion measure
ii. Ex: Pearson correlation coefficient
iii. used to determine validity between 2 measures (r)
iv. Affected by restriction or inflation of range
v. Is the range of scores employed appropriate to the objective of the correlational analysis
vi. No rules regarding the validity coefficient (how high or low it should/could be for test to
be valid)
vii. Incremental validity
Alyssa Louise C. Cabije
BS Psychology
University of Mindanao
1. More than one predictor
2. Incremental validity: the degree to which an additional predictor explains something
about the criterion measure that is not explained by predictors already in use
b. Expectancy data
i. provides info that can be used in evaluating the criterion-related validity of a test
ii. Score obtained on expectancy test/tables-- likelihood testtaker will score within some
interval of scores on a criterion measure (“passing”, “acceptable”, etc.)
iii. Expectancy table: shows the percentage of people within specified test-score intervals
who subsequently were placed in various categories of the criterion
1. May be created from scatter plot
2. Shows relationships
iv. Expectancy chart: graphic representation of an expectancy table
1. The higher the initial rating,the greater the probability of job/academic success
c. Taylor Russell Table – provide an estimate of the extent to which inclusion of a particular test
in the selection system will actually improve selection
i. Selection ratio – relationship betweenthe number of people to be hired and the number
of people available to be hired
ii. Base rate – percentage of people under existing system for a particular position
iii. Relationship between predictor and criterion must be linear
d. Naylor-shine Tables – difference between the means of the selected and unselected groups to
derive an index of what the test is adding to already established procedures
4. Decision theory and Test utility
a. Base rate – extent to which a particular trait, behavior, characteristic or attribute exists in the
population
b. Hit rate – defined as the proportion of people a test accurately identifies as possessing or
exhibiting a particular trait.
c. Miss rate – proportion of people the test fails to identify as having or not having attributes
i. False positive (type I error) – possess particular attribute but actually does not have. Ex:
score above cutoff score,hired but failed the job.
ii. False negative (type II error) – does not possess particular attribute but actually does have.
Ex. Scored below cutoff score, not hired, but could have been successful in the job-
Construct Validity
vi. Construct validity: judgment about the appropriateness of inferences drawn from test scores regarding
individual standings on a variable called a construct
1. Construct: an informed, scientific idea developed or hypothesized to describe or explain behavior
a. Ex: intelligence, depression, motivation, personality, etc.
b. Unobservable, presupposed (underlying) traits that a test developer invokes to describe test
behavior/criterion performance
2. Viewed as unifying concept for all validity evidence
vii. Evidence of Construct Validity
1. Various techniques of construct validation that provide evidence:
a. Test is homogeneous-- measures single construct
b. Test scores increase/decrease as function of age, passage of time, or experimental
manipulation (theoretically predicted)
c. Test scored obtained after some even or passage of time differ from pretest scores
(theoretically predicted)
d. Test scores obtained by people from distinct groups vary (theoretically predicted)
e. Test scores correlate with scores on other tests (theoretically predicted)
2. Evidence of homogeneity
a. Homogeneity: refers to how uniform atest is in measuring a single concept
b. Evidence-- correlations between subtest scores and total test scores
c. Item-analysis procedures have been used in quest for test homogeneity
d. Desirable but not necessary
e. Contributes no info about how construct being measured relates to other constructs
3. Evidence of changes with age
a. If test purports to measure a construct that changes over time then the test scores, too, should
show progressive changes to be considered valid measurement of construct
b. Does not in itself provide info about how construct relates to other constructs

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
4. Evidence of pretest-posttest changes
a. Can be evidence of construct validity
b. Some more typical intervening experiences responsible for changes intest scores are:
i. Formal education
ii. Therapy/medication
iii. Any life experience
5. Evidence from distinct groups/method of contrasted groups
a. Method of contrasted groups: one way of providing evidence for the validity of a test is to
demonstrate that scores on the test vary in a predictable way as a function of membership in
some group
b. Rationale-- if a test is a valid measure of a particular construct, test scores from groups of
people who would presumed with respect to that construct should have correspondingly
different test scores
6. Convergent evidence
a. Evidence for the construct validity of a particular test may converge from a number of sources,
such as tests or measures designed to assess the same/similar construct
b. Convergent evidence: scores on a test undergo construct validity and correlate highly in the
predicted direction with scores on older, more established and already validated tests
designed to measure the same/similar construct
7. Discriminant evidence
a. Discriminant evidence: validity coefficient showing little relationship between test scores
and /or other variables with which scores on the testbeing construct-validated should not
theoretically be correlated
b. Provides evidence of construct validity
c. Multitrait-multimethod matrix: “two or more traits”, “two or more methods”
d. matrix/table that results from correlating variables (traits) within and between methods
8. Factor analysis
a. Factor analysis: shorthand term for a class of mathematical procedures designed to identify
factors or specific variables that are typically attributes, characteristics, or dimension on which
people may differ
b. Frequently used as a data reduction method in which several sets of scores and correlations
between them are analysed
c. Exploratory factor analysis: researchers test the degree to which ahypothetical model fits the
actual data
i. Factor loading: conveys information about the extent to which the factor determines the
test score or scores
ii. Complex procedures
viii. Validity, Bias, and Fairness
1. Test Bias
a. Bias: a factor inherent in a test that systematically prevents accurate, impartial measurement
b. Technical means to identify and remedy bias (mathematically)
c. Bias implies systematic variation
d. Rating error
i. Rating: a numerical or verbal judgment (or both) that places a person or an attribute along
a continuum identified by a scale of numerical or word descriptions, known as a rating
scale
ii. Rating error: judgment resulting from intentional or unintentional misuse of a rating scale
iii. Leniency error/generosity error: error in rating that arises from the tendency on the part
of the rater to be lenient in scoring, marking, and/or grading
iv. Severity error: rater exhibits general and systematic reluctance to giving ratings at either
the positive or negative extreme
2. Overcome restriction of range rating errors is to use rankings: procedure that requires the rater to
measure individuals against one another instead of against an absolute scale
a. Rater is forced to select 1st, 2nd, 3rd, etc.
3. Halo effect: fact that for some raters, some rates can do no wrong
a. Tendency to give a particular ratee a higher rating than he or she objectively deserves
b. Criterion data may be influenced by rater’s knowledge of rate--- race, gender, etc.
ix. Test fairness

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
a. Issues of fairness tend to be more difficult and involve values
b. Fairness: the extent to which a test is used in an impartial, just, and equitable way
c. Sources of misunderstanding
i. Discrimination
ii. Group not included in standardization sample
iii. Performance differences between identified groups
j. Relationship Between Reliability and Validity
i. A test should not correlate more highly with any other variable than it correlates with itself
ii. A modest correlation between the true scores on two traits may be missed if the test for each of the traits
is not highly reliable
iii. We can have reliability without validity
1. It is impossible to demonstrate that an unreliable test is valid

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 7:
UTILITY
Utility: usefulness or practical value of testing to improve efficiency
I. Factors that Affect a Test’s Utility
a. Psychometric Soundness
i. Reliability and validity of a test
ii. Gives us the practical value of both the scores (reliability and validity)
iii. They tell us whether decisions are cost-effective
iv. A valid test is not always a useful test
1. especially if testtakers do not follow test directions
b. Costs
i. Economic and non economic
ii. Ex.) using a less expensive and therefore less stringent application process for airline personnel.
c. Benefits
i. Profits, gains, advantages
ii. Ex.) more stringent hiring policy---more productive employees
iii. Ex.) maintaining successful and academic environment of university
II. Utility Analysis
a. a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a division
about the usefulness and/or practical value of a tool of assessment.
b. An illustration
i. What’s the companies goal?
1. Limit the cost of selection o
a. Don’t use FERT
2. Ensure that qualified candidates are not rejected
a. Set a cut score that yields the lowest false negative rate
3. Ensure that all candidates selected will prove to be qualified
a. Lowest dales positive rate
4. Ensure, to the extent possible, that qualified candidates will be selected and unqualified candidates
will be rejected
a. False positives are no better or worse than false negatives
b. Highest hit rate and lowest miss rate
c. How Is a Utility Analysis Conducted?
i. Objective: dictate what sort of information will be required as well as the specific methods to be used
1. Expectancy Data
a. Expectancy table provides indication of the likelihood that a testtaker will score within some
interval of scores on a criterion measure
b. Used to measure costs vs. benefits
2. Brogden-Cronbach-Gleser formula
a. Utility gain: estimate of the benefit of using a particular test or selection method
b. Most simply is benefits-cost
c. Productivity gain: estimated increase in work output
d. Some Practical Considerations
i. The Pool of Job Applicants
1. There is rarely a limitless supply of potential employees
2. Dependent on many factors, including economic environment
3. We assume that top scoring individuals will accept the job, but those individuals are more likely to be
the ones being offered higher positions
ii. The complexity of the Job
1. It is questionable whether the same utility analysis methods can be used to measure the eligibility of
varying complexities of jobs
iii. The cut score in use
1. Relative cut score: may be define as reference point
a. Based on norm-related considerations rather than on the relationship of test scores to a criterion
b. Also called norm-referenced cut score
c. Ex.) top 10% of test scores get A’so
2. Fixed cut score: set with reference to a judgment concerning a minimum level of proficiency required
to be included in a particular classification.

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
a. Also called absolute cut scores
3. Multiple cut scores: using two or more cut scores with reference to one predictor for the purpose of
categorizing testtakers
a. Ex.) having cut score that marks an A, B, C etc. all measuring same predictor
4. Multiple hurdles: for success, requires one individual to complete many tasks, with elimination at
each level
a. Ex.) written application -- group interview -- personal interview etc.
5. Compensatory model of selection: assumption is made that high scores on one attribute can
compensate for low scores on another attribute
iv. Methods for Setting Cut Scores
1. The Angoff Method: Judgments of experts are averaged
2. The Known Groups Method:
a. Collection of data on the predictor of interest from group known to posses and not to possess
trait, attribute, or ability
b. Cut score based on which test best discriminates the two groups performance
3. IRT-Based Method
a. Based on testtaker’s performance across all items on a test
b. Some portion of test items must be correct
4. Item-mapping method: determining difficulty level reflected by cut score
5. Book-Mark method
a. test items are listed, one per page, in ascending level of difficulty.
b. An expert places a bookmark to mark the divide which separates testtakers who have acquired
minimal knowledge, skills, or abilities and those that have not.
c. Problems include training of experts, possible floor and ceiling effects, and the optimal length of
item booklets
6. Other Methods
a. discriminant analysis: family of statistical techniques used to shed light on the relationship
between certain variables and two or more naturally occurring groups
b. ex.) the relationships between scores of tests and people judged to be successful or unsuccessful
at job

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 8:
TEST DEVELOPMENT

I. STEPS:
a. TEST CONCEPTUALIZATION
i. Thoughts or stimulus that could be almost everything.
ii. An emerging social phenomenon or pattern of behavior might serve as the stimulus for the development of a
new test.
iii. Norm referenced: An item for which high scorers on the test respond correctly. Low scorers respond to that
same item incorrectly
iv. Criterion referenced: high scorers on the test get a particular item right whereas low scorers on the test get
that same item wrong.
v. Pilot work: pilot study or pilot research. To know whether some items should be included in the final form of
the instrument.
1. the test developer typically attempts to determine how best to measure a targeted construct
b. TEST CONSTRUCTION
i. Scaling: process of setting rules for assigning numbers in measurement.
ii. L.L. Thurstone: credited for being the forefront of efforts to develop methodologically sound scaling methods
iii. TYPES OF SCALES:
1. Nominal, ordinal, interval or ratio
2. Age-based scale
3. Grade-based scale
4. Stanine scale (raw score converted to 1-9)
5. Unidimensional vs. multidimensional
a. Unidimensional: measuring one construct
b. Multidimensional: measuring more than one construct
6. Comparative vs. categorical
a. Comparative scaling: entails judgments of a stimulus in comparison with every other stimulus on
the scale
b. Categorical scaling: stimuli are placed into one of two or more alternative categories that differ
quantitatively with respect to some continuum
7. Rating Scale: Which can be defined as a grouping of words, statements, or symbols on which
judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker
8. Summative scale: when final score is obtained by summing the ratings across all the items
9. Likert scale: each item presents the testtaker with five alternative responses usually on agree-
disagree, or approve-disapprove continuum
10. Method of paired comparisons: presented with two stimuli and asked to compare
11. Comparative scaling: judging of a stimulus in comparison with every other stimulus on the scale
12. Categorical scaling: testtaker places stimuli into a category; those categories differ quantitatively on a
spectrum.
13. Guttman scale (Scalogram analysis):
a. items range from sequentially weaker to stronger expressions of attitude, belief, or feeling.
b. A testtaker who agrees with the stronger statement is assumed to also agree with the milder
statements
14. Equal-appearing intervals (Thurstone): direct estimation because don’t need to transform testtaker’s
response to another scale
iv. WRITING ITEMS
1. 3 Questions of test developer:
a. What range of content should the items cover?
b. Which of the many different types of item formats should be employed?
c. How many items should be written in total and for each content area covered?
2. Item pool: reservoir from which items will not be drawn for the final version of the test (should be
about double the number of questions as final will have
3. Item format: variables such as the form, plan, structure, arrangement and layout of individual test
items
a. 2 types
i. selected-response format:
1. testtaker selects a response from a set of alternative responses

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
2. includes multiple choice, true-false, and matching
ii. constructed-response format:
1. testtaker supplies or creates the correct answer
2. includes completion item, short answer and essay
4. Writing Items for computer administration
a. Item bank: relatively large and easily accessible collection of test questions
b. Computerized Adaptive Testing (CAT): interactive, computer-administered testtaking process
wherein items presented to the testtaker are based in part on testtaker’s performance on
previous items.
c. Floor effect: the diminished utility of an assessment tool for distinguishing testtakers at the low
end of the ability, trait, or other attribute being measured
d. Ceiling effect: diminished utility of an assessment tool for distinguishing testtakers at the high
end of the ability, trait, attribute being measured
e. Item branching: ability of computer to tailor the content and order of presentation of test items
on the basis of responses to previous items
v. SCORING ITEMS
1. Cummulative scoring: testtakers earn cumulative credit with regard to a particular construct
2. Class/category scoring: testtaker responses earn credit toward placement in a particular class or
category with other testtakers whose pattern of responses is presumably similar in some way
3. Ipsative scoring: comparing a testtaker’s score on one within a test to another scale within that same
test
a. ex.) “John’s need for achievement is higher than his need for affiliation”
vi. ITEM WRITING
1. Personality and intelligence tests require different sorts of responses
2. Guidelines for item writing
a. Define clearly what you want to measure
b. Generate an item pool
c. Avoid exceptionally long items
d. Keep the level of reading difficulty appropriate for those who will complete the scale
e. Avoid “double-barreled” items that convey two or more ideas at the same time
f. Consider mixing positively and negatively worded items
3. Must be sensitive to ethnic and cultural differences
4. Items that retain their reliability are more likely to focus on skills, while those that lost reliability
focused on more abstract concepts
5. Item Formats
a. Simplest test uses dichotomous format
b. The Dichotomous Format
i. Dichotomous format offers two alternatives for each item
1. Ie. True-false examination
ii. Advantages:
1. Simplicity
2. True-false items require absolute judgment
iii. Disadvantages:
1. True-false encourage students to memorize material
2. “truth” often comes in shades of gray
3. mere chance of getting any item correct is 50%-Yes-no format on personality tests
iv. Multiple-choice = polytomous
c. The Polytomous Format
i. Polytomous format resembles the dichotomous format except that each item has more
than two alternatives
1. Multiple-choice exams
ii. Advantages:
1. Little time for test takers to respond to a particular item because they do not have to
write
iii. Incorrect choices are called distractors
iv. Disadvantages:
1. How many distractors should a test have? --> 3 or 4
2. Distractors hurting reliability / validity of test

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
3. Three alternative multiple-choice items may be better than five alternative items
because they retain the psychometric value but take less time to develop and
administer
4. Scoring of the MC exams? --> simply guessing should elicit correctness
5. Correcting for this though, the expected score is 0 – as getting a question wrong loses
you a point
v. Guessing can be good if you can narrow down a couple answers
vi. Students are more likely to guess when they anticipate a lower grade on a test than when
they are more confident
vii. Guessing threshold describes the chances that a low-ability test taker will obtain each score
viii. True-false and MC tests are common to educational and achievement tests
ix. Likert format, category scale, and the Q-sort used for personality-attitude tests
d. Likert Format
i. Requires that a respondent indicate the degree of agreement with a particular attitudinal
question
1. Strongly disagree ... Strongly agree
2. For measurements of attitude
ii. Used to create Likert Scales: scales require assessment of item discriminability
iii. Familiar and easy --- likely to remain popular in personality and attitude tests
e. Category Format
i. uses more choices than Likert; 10-point rating scale
ii. Disadvantage: responses to items on 10-pt scales are affected by the groupings of the
people or things being rated
iii. People change their ratings depending on context
1. This problem can be avoided if the endpoints of the scale are clearly defined and the
subjects are frequently reminded of the definitions of the endpoints
iv. Optimal number of points is 7?
1. Number depends on the fineness of the discrimination that subjects are willing to
make
2. When people are highly involved with some issue, they will tend to respond best to a
greater number of categories
v. Increasing the number of response categories may not increase reliability and validity
vi. Visual analogue scale: respondent is given a 100-millimeter line and asked to place a mark
between two well-defined endpoints
1. Measures self-rate health
f. Checklists and Q-Sorts
i. Adjective Checklist: subject receives a long list of adjectives and indicates whether each one
is characteristic of himself or herself
1. Requires subjects either to endorse such adjectives or not, thus allowing only two
choices for each item
ii. Q-Sort: increases the number of categories
1. Used to describe oneself or to provide ratings of others
g. Other Possibilities
i. Forced-choice and Likert formats are clearly the most popular in contemporary tests and
measures
ii. Checklists have fallen out of favor because they are more prone to error than are formats
that require responses to every item
iii. Frequent advice is to not use “all of the above” as a response option
6. TEST TRYOUT
a. What is a good item?
i. Reliable and valid
ii. Helps to discriminate testtakers
7. ITEM ANALYSIS
a. The Item-Difficulty Index
i. Obtained by calculating the proportion of the total number of testtakers who answered the
item correctly “p”
ii. Higher p= easier item
iii. Difficulty can be replaced with endorsement in non-achievement tests

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
iv. The midpoint representing the optimal difficulty is obtained by summing up the chance of
success proportion and 1.00 and then dividing the sum by 2
b. Item Reliability Index
i. Indication of the internal consistency of a test
ii. Equal to the product of the item-score standard deviation (s) and the correlation (r) oFactor
analysis and inter-item consistency
iii. Factor analysis determines whether items on a test appear to be measuring the same thing
c. The Item-Validity Index
i. Statistic designed to provide an indication of the degree to which a test is measuring what it
purports to measure
ii. Requires: item-score standard deviation, the correlation between the item score and
criterion score
iii. The Item-Discrimination Index
iv. Measures how adequately an item separates or discriminates between high scorers and low
scorers
v. “d” ocompares performance on a particular item with performance in the upper and lower
regions of a distribution of continuous test scores
vi. higher d means greater number of high scorers answering the item correctly
vii. negative d means low-scoring examinees are more likely to answer the item correctly than
high-scoring examinees
viii. Analysis of item alternatives
d. Item-Characteristic Curves
i. Graphic representation of item difficulty and discrimination
e. Other Considerations in Item Analysis
i. Guessing
1. Usually in some direction
2. Depends on individuals ability to take risks
ii. Item fairness
1. Bias
iii. Speed tests
1. Last items will appear to be more difficult because not everyone got to them
f. Qualitative Item Analysis
i. Qualitative methods: techniques of data generation and analysis that rely primarily on
verbal rather than mathematical or statistical procedures
ii. Qualitative item analysis: various non statistical procedures designed to explore how
individual test items work
1. Through means like interviews and group discussions
iii. “Think aloud” test administration
1. approach to cognitive assessment that entails respondents vocalizing thoughts as they
occur
2. used to shed light on the testtker’s though processes during the administration of a
test
iv. Expert panels
1. Sensitivity review: study of test items in which they are examined for fairness to all
prospective testtakers as well as for the presence of offensive language, stereotypes,
or situations
8. ITEM ANALYSIS (KAPLAN BASED)
a. The Extreme Group Method
i. Compares people who have done well with those who have done poorly on a test-
Difference between these proportions is called the discrimination index
b. The Point Biserial Method
i. Find the correlation between performance on the item and performance on the total test
ii. Correlation between a dichotomous variable and a continuous variable is called a point
biserial correlation
iii. On tests with only a few items, using this is problematic because performance on the item
contributes to the total test score
c. Pictures of Item Characteristics

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
i. Valuable way to learn about items is to graph their characteristics, which you can do with
the item characteristic curve
ii. Prepare a graph for each individual test item
1. Total test score is used as an estimate of the amount ofa ‘trait’ possessed by
individuals
iii. Relationship between performance on the item and performance on the test gives some
info about how well the item is tapping the info we want
d. Drawing the Item Characteristic Curve
i. To draw this, we need to define discrete categories of test performance
ii. If the test has been given to many people, we might choose to make each test score a single
category
iii. Gradual positive slope of the line demonstrates that the proportion of people who pass the
item gradually increases as test scores increase
1. This means that the item successfully discriminates at all levels of test performance
iv. Ranges in which the curve changes suggest that the item is sensitive, while flat ranges
suggest areas of low sensitivity
v. Item analysis breaks the general rule the increasing the number of items makes a test more
reliable
vi. When bad items are eliminated, the effects of chance responding can be eliminated and the
test can become more efficient, reliable, and valid
vii.
e. Item Response Theory
i. According to classical test theory, a score is derived from the sum of an individual’s
responses to various items, which are sampled from a larger domain that represents a
specific trait or ability
ii. New approaches consider the chances of getting particular items right or wrong – item
response theory – make extensive use of item analysis
1. With this, each item on a test has its own item characteristic curve that describes the
probability of getting each particular item right or wrong given the ability level of each
test taker
2. Testers can make an ability judgment without subjecting the test taker to all of the
test items
3. Technical adv: builds on traditional models of item analysis and can provide info on
item functioning, the value of specific items, and the reliability of a scale
4. Two dimensions used are difficulty and discriminability
5. Most attractive adv. Is that one can easily adapt the IRT tests for computer
administration
a. Computer can rapidly identify the specific items that are required to assess a
particular ability level
6. “peaked conventional”
7. “rectangular conventional” – requires that test items be selected to create a wide
range in level of difficulty
a. Problem: only a few items of the test are appropriate for individuals at each
ability level; many test takers spend much of their time responding to items
either considerably below their ability level or too difficult to solve
8. IRT addresses traditional problems in test construction well
9. IRT can identify respondents with unusual response patterns and offer insights into
cognitive processes of the test taker
10. May also reduce the biases against the people who are slow in completing test
problems
iii. External Criteria
1. Item analysis has been persistently plagued by researchers’ continued dependence on
internal criteria, or total test score, for evaluating items
iv. Linking Uncommon Measures
1. One challenge in test applications is how to determine linkages between two different
measures
v. Items for Criterion-Referenced Tests

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. Traditional use of tests requires that we determine how well someone has done on a
test by comparing the person’s performance to that of others
2. Criterion-referenced tests compares performance with some clearly defined criterion
for learning
a. Popular approach in individualized instruction programs
b. Regarded as diagnostic instruments
3. First step in developing these tests involves clearly specifying the objectives by writing
clear and precise statements about what the learning program is attempting to
achieve
4. To evaluate the items: one should give the test to two groups of students – one that
has been exposed to the learning unit and one that has not
5. Bottom of the V is the antimode – the least frequent score
6. This point divides those who have been exposed to the unit from those who have not
been exposed and is usually taken as the cutting score or point, or what marks the
point of decision
7. When people get scores higher than the antimode, we assume that they have met the
objective of the test
vi. Limitations of Item Analysis
1. Main Problem: though statistical methods for item analysis tell the test constructor
which items do a good job of separating students ,they do not help the students learn
2. Although the data are available to give the child feedback on the “bug” in their
thinking, nothing in the testing procedure initiates this guidance
9. TEST REVISION
a. Test Revision in the Life Cycle of an Existing Test
i. Tests get old and need revision
ii. Questions arise over equivalence of two tests
iii. Cross-validation and Co-validation
1. Cross-validation: revalidation of a test on a sample of testtakers other than those on
whom test performance ewas originally found to be a valid predictor of some criterion
2. Validity shrinkage: decrease in item validities that inevitably occurs after cross-
validation of finding
3. Co-validation: test validation process conducted on two or more tests using the same
sample of testtakers
4. Co-norming: when co-validation is used in conjunction with the creation of norms or
the revision of existing norms
5. Quality assurance during test revision
a. test givers must have some degree of qualification, training, and testing
b. anchor protocol: test protocol scored by a highly authoritative scorer that is
designed as a model for scoring and a mechanism for resolving scoring
discrepancies
c. scoring drift: a discrepancy between scoring in an anchor protocol and the
scoring of another protocol
b. The Use of IRT in Building and Revising Tests(item response theory)
i. Evaluating the properties of existing tests and guiding test revision
ii. Determining measurement equivalence across testtaker populations
1. Differential item functioning (DIF): phenomenon, wherein an item functions
differently in one group of testtakers as compared to another group of testtakers
known to have the same level of the underlying trait
iii. Developing item banks
1. Items from other instruments -> item pool -> scrutiny -> preliminary item bank ->
psychometric testing -> item bank

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 9:
INTELLIGENCE AND ITS MEASUREMENT

I. What is Intelligence?
a. Intelligence: a multifaceted capacity that manifests itself in different ways across the lifespan. Usually includes
abilities to:

i. Acquire and apply knowledge vii. Pay attention


ii. Reason logically viii. Be intuitive
iii. Plan effectively ix. Find the right words and thoughts
iv. Infer perceptively with facility
v. Make judgment and solve problems x. Cope with, adjust to, and make the
vi. Grasp and visualize concepts most of new situations

b. Intelligence Defines: Views of the Lay Public


i. Both social and academic
c. Intelligence Defined: Views of Scholars and Test Professionals
i. Francis Galton
1. First to publish on heritability of intelligence
2. Most intelligent persons were those with the best sensory abilities
ii. Alfred Binet
1. Made tests about intelligence, but didn’t define it
2. Components of intelligence: reasoning, judgment, memory, abstraction
3. Added that definition is complex; requires interaction of components
4. He argued that when one solves a particular problem, the abilities used cannot be separated
because they interact to produce the solution.
iii. David Wechsler
1. Best way to measure this global ability was by measuring aspects of several “qualitatively
differentiable” abilities
2. Complexity of intelligence
3. Conceptualization as an “aggregate” or “global” capacity
iv. Jean Piaget
1. Studied children
2. Believed order of maturation to be unchangeable
3. With age, increased schema: organized action or mental structure that, when applied to the world,
leads to knowing or understanding.
4. Learning occurred through assimilation (actively organizing new information so that it fits in with
what already is perceived and thought) and accommodation (changing what is already perceived or
though so that it fits with new information)
5. Sensorimotor (0-2)
6. Preoperational (2-6)
7. Concrete Operational (7-12)
8. Formal Operational (12 and older)
v. All share interactionism: complex concept by which heredity and environment are presumed to interact
and influence the development of one’s intelligence
vi. Factor-analytic theories: focus is squarely on identifying the ability(ies) deemed to constitute intelligence
vii. Information-processing theories: focus is on identifying the specific mental processes that constitute
intelligence.
d. Factor-Analytic Theories of Intelligence
i. Charles Spearman: pioneered new techniques to measure intercorrelations between tests.
1. Existence of a general intellectual ability factor (g) that tapped by all other mental abilities.
2. g representing the portion of the variance that all intelligence tests have in common and the
remaining portions of the variance being accounted for either by specific components (s) or by error
components (e)

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
3. greater g = better test was thought to predict overall intelligence
4. group factors: neither as general as g nor as specific as s
a. ex.) linguistic, mechanical, arithmetical abilities
5. Guilford: multiple-factor models of intelligence
a. Explain mental activities by deemphasizing, any reference to g
6. Thurstone: conceived intelligence as being composed of 7 primary abilities.
7. Gardner: developed theory of multiple intelligences
a. Question over whether emotional intelligence exists.
b. Logical-mathematical, bodily-kinesthetic, linguistic, musical, spatial, interpersonal and
intrapersonal
8. Raymond Cattell: fluid vs. crystallized intelligence
a. Crystallized intelligence: acquired skills and knowledge and their retrieval. Retrieval of
information and application of general knowledge
b. Fluid intelligence: nonverbal, relatively culture-free, and independent of specific instruction.
9. Horn: added more to 7 factors
a. Vulnerable abilities: decline with age and tend to return preinjury levels following brain
damage
b. Maintained abilities: tend not to decline with age and may return to preinjury levels following
brain damage.
10. Carrol:
a. Three-stratum theory of cognitive abilities: like geology
b. Hierarchical model: meaning that all of the abilities listed in a stratum are subsumed by or
incorporated in the strata above.
c. Those in the first stratum are narrow abilities
11. CHC model (Cattell-Horn-Carroll)
a. Some overlap some difference
b. Doesn’t use g
c. Has broader abilities than Carroll’s theory
12. McGrew: Integrated the Cattell-Horn and Carroll’s model
13. McGrew and Flanagan: integrated McGrew-Flanagan CHC Model
a. Features 10 broad stratum abilities
b. 70 narrow-stratum abilities
c. Makes no provision for the general intellectual ability factor (g)
d. It was omitted because it has little practical relevance to cross-battery assessment and
interpretation
e. The Information-Processing View
i. Aleksandr Luria
1. How (not what) information is processed
2. Simultaneous/parallel processing: integrated all at once
3. Successive/sequential processing: each bit individuallyprocessed
ii. PASS model: (Planning, attention, simultaneous, successive)model of assessing intelligence
iii. Sternberg ‘The essence of intelligence is that it provides a means to govern ourselves so that our thoughts
and actions are organized, coherent, and responsive to both out internally driven needs and to the needs of
the environment
II. Measuring Intelligence
a. Types of Tasks Used in Intelligence Test
i. Infants: test sensorimotor, interviews with parents
ii. Older child: verbal and performance abilities
iii. Mental Age: index that refers to chronological age equivalent to one’s test performance
iv. Adults: retention of general information, quantitative reasoning, expressive language and memory, and
social judgment
b. Theory in Intelligence Test Development and Interpretation
i. Weschler made a dichotomous test (Performance and Verbal), but advocated multifaceted definition
ii. Thorndike: intelligence = social, concrete, abstract

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
iii. Putting theories into test are extremely hard
III. Intelligence: Some Issues:
a. Nature vs. Nurture
i. Currently believed to be mix of two
ii. Performationism: all structures, including intelligence are had at birth and can’t be improved upon
iii. Led to predeterminism: one’s abilities are predetermined by genetic inheritance and no learning or
intervention can enhance it
iv. Interactionist: people inherit certain intellectual potential
1. Theres a limit to genetic abilities (i.e. can’t ever have x-ray vision)
b. The Stability of Intelligence
i. Stable pretty much throughout one’s adult life
ii. Cognitive abilities seem to decline with age
c. The Construct Validity of Tests of Intelligence
i. Having construct validity requires having unified understanding of what intelligence is
ii. Very difficult. Spearman says its one thing, Guilford says its many
iii. Thorndike approach is sort of compromise
1. Look for one central factor with three additional factors representing social, concrete, and abstract
intelligences
d. Other Issues
i. Flynn effect: IQ scores seem to rise every year, but not coupled with rise in “true intelligence”
ii. Personality
1. High IQ: Need for achievement, competition, curiosity, confidence, emotional stability etc.
2. Low IQ: passivity, dependence, maladjustment
3. Temperament (used to describe infants)
iii. Gender
1. Men usually outscore in visual spatialization tasks and intelligence scores
2. Women tend to outscore in language-skill tasks
3. But differences can be bridged
iv. Family Environment
1. Divorce can have negative effects
2. Begins with “maternal effects” in womb
v. Culture
1. Provides specific models for thinking, acting and feeling
2. Assumed that if cultural factors can be controlled then differences between cultural groups will be
lessened
3. Assumed that culture can be removed by the reliance on exclusively nonverbal tasks
a. Tend not to be very good at predicting success in various academic and business settings
4. Culture loading: the extent to which a test incorporates the vocabulary, concepts, traditions,
knowledge and feelings associated with a particular culture
5. No test can be culture free
6. Culture-fair intelligence test: test/assessment process designed to minimize the influence of culture
with regard to various aspects of evaluation procedure
7. Another approached called for cultural-specific intelligence tests
a. Ex.) BITCH measured streetwiseness
b. Lacked predictive validity and useful, practical information

CHAPTER 10:
TESTS OF INTELLIGENCE
I. The Stanford-Binet Intelligence Scales
a. First to have detailed administration and scoring instructions
b. First American test to test IQ
c. First to use alternate items (an item that can be used in place of another)
d. Lacked minority group representation
e. Ratio IQ=(mental age/chronological age)x100Deviation Ratio/test composite: performance of one individual
compared to the performance of others of the same age. Has mean of 100 and standard deviation of 16
Alyssa Louise C. Cabije
BS Psychology
University of Mindanao
f. Age scale: items grouped by age
g. Point scale: items organized by category
II. The Stanford-Binet Intelligence Scales: Fifth Edition
a. Measures fluid intelligence, crystallized knowledge, quantitative knowledge, visual-processing, and short-
term (working) memory
b. Utilizes adaptive testing: testing individually tailored to testtakers to ensure that items are neither too
difficult (frustrating) or too easy (false hope)
c. Examiner establishes rapport with testtaker, then administers routing test to direct, route examinee to test
items most likely at optimal level of difficulty
d. Teaching items: show testtaker what is expected, how to do it.
i. Can be used for qualitative assessment, but not scoring
e. Subtests for verbal and nonverbal tests share same name, but involve different tasks
f. Floor: lowest level of items on subtest
g. Ceiling: highest-level item of subtest
h. Basal level: base-level criterion that must be met for testing on the subtest to continue
i. Ceiling level is met when testtaker fails certain number of items in a row. Test discontinues here.
j. Scores: raw -> standard -> composite
k. Extra-test behavior: behavioral observation
III. The Wechsler Tests
Commonality between all versions: all yield deviation IQ’s with mean of 100 and standard deviation of 15
a. Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV)
i. Core subtest: administered to obtain a composite score
ii. Supplemental/Optional Subtest: provides additional clinical information or extending the number of
abilities or processes sampled.
iii. Yields four index scores: Verbal Comprehension Index, a Working Memory Index, a Perceptual
Reasoning Index, and a Processing Speed Index
b. The Wechsler Intelligence Scale for Children –Fourth Edition (WISC-IV)
i. Process score: index designed to help understand how testtakers process various kinds of information
ii. WISC-IV compared to the SB5
c. The Wechsler Preschool and Primary Scale of Intelligence-Third Edition (WPPSI-III)
i. New school for children under 6
ii. First major intelligence test which adequately sampled total population of the United States
iii. Subtests labeled core, supplemental, or optional
d. Wechsler, Binet, and the Short Form
i. Short form: test that has been abbreviated in length to reduce time needed to administer, score and
interpret
ii. used with caution, only for screening
iii. provide only estimates
iv. reducing the number of items usually reduces reliability and thus validity
v. Wechsler Abbreviated Scale of Intelligence

e. The Wechsler Test in Perspective


i. Factor Analysis
1. Exploratory factor analysis: summarizing data when we are not sure how many factors are
present in our data
2. Confirmatory factor analysis: used to test highly specific factor analysis
IV. Other Measures of Intelligence
a. Tests Designed for Individual Administration
i. Kaufman Adolescent and Adult Intelligence Test
ii. Kaufman Brief Intelligence Test
iii. Kaufman Assessment Battery for Children
iv. Away from information processing and towards a distinction between sequential and simultaneous
processing
b. Tests Designed for Group Administration

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
i. Group Testing in the Military
1. WWI -> need for government to test intelligence as means of differentiating “unfit” and
“exceptionally superior ability”
2. Army Alpha Test: to army recruits who could read. Included general information questions,
analogies, and scrambled sentences to reassemble
3. Army Beta Test: to foreign or illiterate recruits, included mazes, coding, and picture
completion.
4. After the war, the alpha and beta test were used rampantly, and oftentimes misused
5. Screening tools: instrument of procedure used to identify a particular trait or constellation of
traits
6. ASVAB (Armed Services Vocational Aptitude Battery): administered to prospective to recruits
or high school students looked for career guidance
a. 5 career areas: clerical, electronics, mechanical, skill-technical, and combat operations
7. Group Testing in Schools
a. Useful in developing child’s profile- but cannot be sole indicator
b. Groups of 10-15
c. Starting in Kindergarten
d. Also called traditional group testing, because more modern forms can utilize computer.
These more aptly called individual testing
ii. Measures of Specific Intellectual Abilities
1. Widely used intelligence tests only test a sampling of the many attributable factors aiding in
intelligence
2. Ex.) Creativity
a. Commonly thought to be composed of originality, fluency, flexibility, and elaboration
b. If the focus is too heavily on whether an answer is correct, doesn’t allow for creativity
c. Achievement tests require convergent thinking: deductive reasoning process that
entails recall and consideration of facts as well as a series of logical judgments to
narrow down solutions and eventually arrive at one solution
d. Divergent thinking: a reasoning process in which thought is free in many different
directions, making several solutions possible
i. Associated words, uses of rubber band etc.
ii. Test-retest reliability for some of these tests are near unacceptable

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 11:
Other Individual Tests of Ability in Education and Special Education
I. Alternative Individual Ability Tests Compared with the Binet and Wechsler Scales
a. None of these are clearly superior from a psychometric standpoint
b. Some less stable, most more limited in their documented validity
c. Compare poorly to Binet and Wechsler on all accounts
d. They don't rely on a verbal response as much as the B and W
e. Just use pointing or Yes/No responses, thus do not depend on the complex integration of visual and motor
functioning
f. Contain a performance scale or subscale
g. Their specificity often limits the range of functions or abilities that they can measure
h. Because they are designed for special populations, some alternatives can be administered totally without the
verbal instruction
II. Specific Individual Ability Tests
a. Earliest individual tests typically designed for specific purposes or populations
b. One of the first – Seguin Form Board Test – in 1800s – produced only a single score
i. Used primarily to evaluate mentally retarded adults and emphasized speed and performance
c. After, the Healy-Fernald Test was developed as an exclusively nonverbal test for adolescent delinquents
d. Knox developed a battery of performance tests for non-English adult immigrants to the US – administered
without language; speed not emphasized
e. These early individual tests designed for specific populations, produced a single score, and had nonverbal
performance scales
f. Could be administered without visual instructions and used with children as well as adults
III. Infant Scales
a. Where mental retardation or developmental delays are suspected, these tests can supplement observation,
genetic testing, and other medical procedures
IV. Brazelton Neonatal Assessment Scale (BNAS)
a. Individual test for infants between 3days and 4weeks
b. Purportedly provides an index of a newborn’s competence
c. Favorable reviews-Considerable research based
d. Wide use as a research tool and as a diagnostic tool for special purposes
e. Commonly used scale for the assessment of neonates
f. Drawbacks:
i. No norms are available
ii. More research is needed concerning the meaning and implication of scores
iii. Poorly documented predictive and construct validity
iv. Test-retest reliability leaves much to be desired
V. Gesell Developmental Schedules(GDS)
a. Infant intelligence measures
b. Used as a research tool by those interested in assessing infant intellectual development after exposure to
mercury, diagnoses of abnormal brain formation in utero and assessing infants with autism
c. Children of 2.3mth to 6.3yrs
d. Obtains normative data concerning various stages in maturation
e. Individual’s developmental quotient (DQ) is determined according to a test score, which is evaluated by
assessing the presence or absence of behavior associated with maturation
f. Provides an intelligence quotient like that of the Binet
i. (development quotient / chronological age) x 100
g. But, falls short of acceptable psychometric standards
h. Standardization sample not representative of the population
i. No reliability or validity
j. Does appear to help uncover subtle deficits in infants

VI. Bayley Scales of Infants and Toddler Development – Third Edition (BSID-III)
a. Base assessments on normative maturational developmental data

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
b. Designed for infants between 1 and 42mths
c. Assesses development across 5 domains: cognitive, language, motor, socioemotional, and adaptive
d. Motor scale: assumes that later mental functions depend on motor development
e. Excellent standardization
f. Generally positive reviews-Strong internal consistency
g. More validity studies needed
h. Widely used in research – children with Down syndrome, pervasive developmental disorders, cerebral palsy,
language impairment, etc
i. Most psychometrically sound test of its kind
j. Predictive though?
VII. Cattell Infant Intelligence Scale (CIIS)
a. Based on normative developmental data
b. Downward extension of Stanford-Binet scale for 2-30mth olds
c. Similar to Gesell scale-Rarely used today
d. Sample is primarily based on children of parents from lower and middle classes and therefore does not represent
the general population
e. Unchanged for 60yrs
f. Psychometrically unsatisfactory
VIII. Major Tests for Young Children
a. McCarthy Scales of Children’s Abilities (MSCA)
i. Measure ability in children between 2-8yrs-Present a carefully constructed individual test of human ability
ii. Meager validity
iii. Produces a pattern of scores as well as a variety of composite scores
iv. General cognitive index (CGI): standard score with a mean of 100 and a standard deviation of 16
1. Index reflects how well the child has integrated prior learning experiences and adapted them to the
demands of the scales
v. Relatively good psychometric properties
vi. Reliability coefficients in the low .90s
vii. In research studies
viii. Good validity? Good assessment tool
b. Kaufman Assessment Battery for Children - Second Edition (KABC-II)
i. Individual ability test for children between 3-18yrs-18 subtests in 5 global scales called sequential
processing, simultaneous processing, learning, planning, and knowledge
ii. Intended for psychological, clinical, minority-group, preschool, and neuropsychological assessment as well
as research-Sequential-simultaneous distinction
1. Sequential processingrefers to a child’s ability to solve problems by mentally arranging input in
sequential or serial order
2. Simultaneous processing refers to a child’s ability to synthesize info from mental wholes in order to
solve a problem
iii. Nonverbal measure of ability too
iv. Well constructed and psychometrically sound-Not much evidence of(good) validity
v. Poorer predictive validity for school achievement – smaller differences between whites and minorities
vi. Test suffers from a non-correspondence between its definition and its measurement of intelligence
IX. General Individual Ability Tests for Handicapped and Special Populations
a. Columbia Mental Maturity Scale– Third Edition (CMMS)
i. Purports to evaluate ability in normal and variously handicapped children from 3-12yrs
ii. Requires neither a verbal response nor fine motor skills
iii. Requires subject to discriminate similarities and differences by indicating which drawing does not belong on
a 6-by-9inch card containing 3-5 drawings
iv. Multiple choice
v. Standardization sample is impressive
vi. Vulnerable to random error
vii. Reliable instrument that is useful in assessing ability in many people with sensory, physical, or language
handicaps

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
viii. Good screening device
b. Peabody Picture Vocabulary Test– Fourth Edition (PPVT-IV)-2-90yrs
i. multiple choice tests that require subject to indicate Yes/No in some manner
ii. Instructions administered aloud (not for the deaf)
iii. Purports to measure hearing or receptive vocabulary, presumably providing a nonverbal estimate of verbal
intelligence
iv. Can be done in 15mins, requires no reading ability
v. Good reliability and validity
vi. Should never be used as a substitute for a Wechsler or Binet IQ
vii. Important component in a test battery or used as a screening device
viii. Easy to administer and useful for variety of groups
ix. BUT: Tendency to underestimate IQ scores, and problems inherent in the multiple-choice format are bad
c. Leiter International Performance Scale – Revised (LIPS-R)
i. Strictly a performance scale-Aims at providing a nonverbal alternative to the Stanford-Binet scale for 2-18yr
olds
ii. For research, and clinical settings, where it is still widely utilized to assess the intellectual function of
children with pervasive developmental disorders
iii. Purports to provide a nonverbal measure of general intelligence by sampling a wide variety of functions
from memory to nonverbal reasoning
iv. Can be applied to the deaf and language-disabled
v. Untimed
vi. Good validity
d. Porteus Maze Test (PMT)
i. Popular but poorly standardized nonverbal performance measure of intelligence
ii. Individual ability test
iii. Consists of maze problems (12)
iv. Administered without verbal instruction, thus used for a variety of special populations
v. Needs re-standardization
e. Testing Learning Disabilities
i. Major concept is that a child average in intelligence may fail in school because of a specific deficit or
disability that prevents learning
ii. Federal law entitles every eligible child with a disability to a free appropriate public education and
emphasizes special education and related services designed to meet his or her unique needs and prepare
them for further education, employment, and independent living
iii. To qualify, child must have a disability and educational performance affected by it
iv. Educators today can find other ways to determine when a child needs extra help
v. Processed called Response to Intervention (RTI): premise is that early intervening services can prevent
academic failure for many students with learning difficulties
vi. Signs of learning problem:
1. DisorganizationoCareless effort
2. Forgetfulness
3. Refusal to do schoolwork or homework
4. Slow performance
5. Poor attention
6. Moodiness
f. Illinois Test of Psycholinguistic Abilities (ITPA-3)
i. Assumes that failure to respond correctly to a stimulus can result not only from a defective output system
but also from a defective input or information
ii. processing system
iii. Stage 1: info must first be received by the senses before it can be analysed
iv. Stage 2: info is analyzed or processed
v. Stage 3: with processed info, individual must make a response
vi. Theorizes that the child may be impaired in one or more specific sensory modalities

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
vii. 12 subtests that measure individual’s ability to receive visual, auditory, or tactile input independently of
processing and output factors-purports to help isolate the specific site of a learning disability
viii. For children 2-10yrs
ix. Early versions hard to administer and no reliability or validity
x. Now, with revisions, ITPA-3 psychometrically sound measure of children’s psycholinguistic abilities
g. Woodcock-Johnson III
i. Evaluates learning disabilities
ii. Designed as a broad-range individually administered test to be used in educational settings
iii. Assesses general intellectual ability, specific cognitive abilities, scholastic aptitude, oral language, and
achievement
iv. Based on the CHC three-stratum theory of intelligence
v. Compares child’s score on cognitive ability with sore on achievement – can evaluate possible learning
problems
vi. Relatively good psychometric properties
vii. For learning disability tests, three conclusions seem warranted:
1. Test constructors appear to be responding to the same criticisms that led to changes in the Binet
and Wechsler scales and ultimately to the development of the KABC
2. Much more empirical and theoretical research is needed
3. Users orlearning disabilitiestests should take great pains to understand the weaknesses of these
procedures and not over-interpret results
h. Visiographic Tests
i. Require a subject to copy various designs
i. Benton Visual Retention Test – Fifth Edition (BVRT-V)
i. Tests for brain damage are based on the concept of psychological deficit, in which a poor performance on a
specific task is related to or caused by some underlying deficit
ii. Assumes that brain damage easily impairs visual memory ability
iii. For individuals 8yrs+-Consists of geometric designs briefly presented and then remove
iv. Computerized versiondeveloped
j. Bender Visual Motor Gestalt Test (BVMGT)
i. Consists of 9 geometric figures that the subject is imply asked to copy
ii. By 9yrs, any child of normal intelligence can copy the figures with only one or two errors
iii. Errors occur for people whose mental age is less than 9, brain damage, nonverbal learning disabilities,
emotional problems
iv. Questionable reliability
k. Memory-for-Designs (MFD) Test
i. Drawing test that involves perceptual-motor coordination
ii. Used for people 8-60yrs
iii. Good split-half reliability
iv. Needs for validity documentation
v. All these tests criticized because of their limitations in reliability and validity documentation
vi. Good as screening devices though

l. Creativity: Torrance Tests of Creative Thinking (TTCT)


i. Measurement of creativity underdeveloped in psychological testing
ii. Creativity: ability to be original, to combine known facts in new ways, or to find new relationships between
known facts
iii. Evaluating this a possible alternative to IQ
iv. Creativity tests in early stages of development
v. Torrance tests separately measure aspects of creative thinking such as fluency, originality, and flexibility
vi. Does not meet the Binet and Wechsler scales in terms of standardization, reliability, or validity-Unbiased
indicator of giftedness
vii. Inconsistent tests, but available data reflect the tests’ merit and fine potential
m. Individual Achievement Tests: Wide Range Achievement Test-3(WRAT-4)
Alyssa Louise C. Cabije
BS Psychology
University of Mindanao
i. Achievement tests measure what the person has actually acquired or done with that potential
ii. Discrepancies between IQ and achievement have traditionally been the main defining feature of a learning
disability
iii. Most achievement tests are group tests
iv. WRAT-4 purportedly permits an estimate of grade-level functioning in word reading, spelling, math
computation, and sentence comprehension
v. Used for children 5yrs+-Easy to administer
vi. Problems:
1. Inaccuracy in evaluating grade-level reading ability
2. Not proven as psychometrically sound

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
CHAPTER 12:
STANDARDIZED TESTS IN EDUCATION, CIVIL SERVICE, AND THE MILITARY

When justifying the use of group standardized tests, test users often have problems defining what exactly they are
trying to predict, or what the test criterion is
I. Comparison of Group and Individual Ability Tests
a. Individual tests require a single examiner for a single subject
i. Examiner provides instructions
ii. Subject responds, examiner records response
iii. Examiner evaluates response
iv. Examiner takes responsibility for eliciting a maximum performance
v. Scoring requires considerable skill
b. Those who use the results of group tests must assume that the subject was cooperative and motivated
i. Many subjects tested at a time
ii. Subjects record own responses
iii. Subjects not praised for responding
iv. Low scores on group tests often difficult to interpret
v. No safeguards
II. Advantages of Individual Tests
a. Provide info beyond the test score
b. Allow the examiner to observe behavior in a standard setting
c. Allow individualized interpretation of test scores
III. Advantages of Group Tests
a. Are cost-efficient
b. Minimize professional time for administration and scoring
c. Require less examiner skill and training
d. Have more objective and more reliable scoring procedures
e. Have especially broad application
IV. Overview of Group Tests
a. Characteristics of Group Tests
i. Characterized as paper-and-pencil or booklet-and-pencil tests because only materials needed are a printed
booklet of test items, a test manual, scoring key, answer sheet, and pencil
ii. Computerized group testing becoming more popular
iii. Most group tests are multiple choice – some free response
iv. Group tests outnumber individual tests
1. One major difference is whether the test is primarily verbal, nonverbal, or combination
v. Group test scores can be converted to a variety of units
b. Selecting Group Tests
i. Test user need never settle for anything but well
ii. documented and psychometrically sound tests
c. Using Group Tests
i. Reliable and well standardized as the best individual tests
ii. Validity data for some group tests are weak/meager/contradictory
d. Use Results with Caution
i. Never consider scores in isolation or as absolutes
ii. Be careful using tests for prediction
iii. Avoid over-interpreting test scores
e. Be Especially Suspicious of Low Scores
i. Assume that subjects understand purpose of testing, want to succeed, and are equally rested/free of stress
f. Consider Wide Discrepancies a Warning Signal
i. May reflect emotional problems or severe stress

g. When in Doubt, Refer


i. With low scores, discrepancies, etc, refer the subject for individual testing

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
ii. Get trained professional
h. Group Tests in the Schools: Kindergarten Through 12th Grade
i. Purpose of tests is to measure educational achievement in school children
i. Achievement Tests verses Aptitude Tests
i. Achievement tests attempt to assess what a person has learned following a specific course of instruction
1. Evaluate the product of a course of training
2. Validity is determined primarily by content-related evidence
ii. Aptitude tests attempt to evaluate a student’s potential for learning rather than how much a student has
already learned
1. Evaluate effects of unknown and uncontrolled experiences
2. Validity is judged primarily on its ability to predict future performance
j. Intelligence test measures general ability
i. These three tests are highly interrelated
k. Group Achievement Tests
i. Stanford Achievement Test one of the oldest of the standardized achievement tests widely used in school
system
ii. Well-normed and criterion-referenced, with psychometric documentation
iii. Another one is the Metropolitan Achievement Test, which measures achievement in reading by evaluating
vocab, word recognition, and reading comprehension
iv. Both of these are reliable and normed on big samples
l. Group Tests of Mental Abilities (Intelligence)
i. Kuhlmann-Anderson Test (KAT) – 8th Edition
1. KAT is a group intelligence test with 8 separate levels covering kindergarten through 12th grade
2. Items are primarily nonverbal at lower levels, requiring minimal reading and language ability
3. Suited to young children and those who might be handicapped in following verbal procedures
4. Scores can be expressed in verbal, quantitative, and total scores
5. Scores at other levels can be expressed at percentile bands: like a confidence interval; provides the
range of percentiles that most likely represent a subject’s true score
6. Good construction, standardization, and other excellent psychometric qualities
7. Good validity and reliability
8. Potential for use and adaptation for non-English-speaking individuals or even countries needs to be
explored
ii. Henmon-Nelson Test (H-NT)
1. Of mental abilities
2. 2 sets of norms available:
a. one based on raw score distributions by age, the other on raw scores distributions by grade
b. reliabilities in the .90s
c. helps predict future academic success quickly
d. does NOT consider multiple intelligences
iii. Cognitive Abilities Test (COGAT)
1. Good reliability
2. Provides three separate scores though: verbal, quantitative, and nonverbal
3. Item selection is superior to the H-NT in terms of selecting minority, culturally diverse, and
economically disadvantaged children
4. Can be adopted for use outside the US
5. No cultural bias-Each of the subtests required 32-34 minutes of actual working time, which the
manual recommends to be spread out over 2-3 days
6. Standard age scores averaged some 15pts lower for African American students on the verbal battery
and quantitative batteries
iv. Summary of K-12 Group Tests
1. All are sound, viable instruments
v. College Entrance Tests
1. SAT Reasoning Test, Cooperative School and College Ability Tests, and American College Test
vi. SAT Reasoning Test

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
1. Most widely used college entrance test
2. Used for 1000+ private and public institutions
3. Renorming of the SAT did not alter the standing of test takers relative to one another in terms of
percentile rank
4. New scoring (2400) is likely to reduce interpretation errors, as interpreters can no longer rely on
comparisons with older versions-45mins longer – 3hrs and 45mins to administer
5. may disadvantage students with disabilities such as ADD
6. Verbal section now called “critical reading” – focus on reading comprehension
7. Math section eliminated much of the basic grammar school math questions
8. Weakness: poor predictive power regarding the grades of students who score in the middle ranges
9. Little doubt that the SAT predicts first-year college GPA
a. But, African Americans and Latinos tend to obtain lower scores on average
b. Women score lower on SAT but higher in GPA
vii. Cooperative School and College Ability Tests
1. Falling out of favour
2. Developed in 1955, not been updated
3. Purports to measure school-learned abilities as well as an individual’s potential to undertake
additional schooling
4. Psychometric documentation not strong
5. Little empirical data support its major assumption – that previous success in acquiring school-learned
abilities can predict future success in acquiring such abilities
viii. American College Test
1. Updated in 2005, particularly useful for non-native speakers of English
2. Produces specific content scores and a composite
3. Makes use of the Iowa Test of Educational Development Scale
4. Compares with the SAT in terms of predicting college GPA alone or in conjunction with high-school
GPA
5. Internal consistency coefficients are not as strong in the ACT
ix. Graduate And Professional School Entrance Tests
1. Graduate Record Examination Aptitude Test
a. GRE purports to measure general scholastic ability
b. Most frequently used in conjunction with GPA, letters of rec, and other academic factors
c. General section with verbal and quantitative scores
d. Third section which evaluates analytical reasoning – now essay format
e. Contains an advanced section that measures achievement in at least 20 majors
f. New 130-170 scoring scale
g. Standard mean score of 500, and SD of 100
h. Normative sample is relatively small-Psychometric adequacy is less than that of SAT – validity
and reliability
i. Predictive validity not great
j. Overpredicts the achievement of younger students while under predicting performance of older
students
k. Many schools have developed their own norms and psychometric documentation and can use
the GRE to predict success in their programs
l. By looking at a GRE score in conjunction with GPA, graduate success can be predicted with
greater accuracy than without the GRE
m. Graduate schools also frequently complain that grades no longer predict scholastic ability well
because of grade inflation – the phenomenon of rising average college grades despite declines in
average SAT scores
i. Led to corresponding restriction in therange of grades
ii. As the validity of grades and letters of rec becomes more questionable, reliance on test
scores increases
iii. Definite overall decline in verbal scores while quantitative and analytical scores are
gradually rising

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
2. Miller Analogies Test
a. Designed to measures scholastic aptitudes for graduate studies
b. Strictly verbal-60 minutes
c. knowledge of specific content and a wide vocab are very useful
d. most important factors appear to be the ability to see relationships and a knowledge of the
various ways analogies can be formed
e. psychometric adequacy is reasonable
f. does not predict research ability, creativity, and other factors important to grad school
3. The Law School Admission Test
a. LSAT problems require almost no specific knowledge
b. Extreme time pressure
c. Three types of problems: reading comprehension, logical reasoning (~half), and analytical
reasoning
d. Weight given to the LSAT score is openly published for each school approved by the American
Bar Association
e. Entrance into schools based on weighted sum of score and GPA
f. Psychometrically sound, reliability coefficients in the.90s
g. Predicts first-year GPA in law school
h. Content validity is exceptional
i. Bias for minority group members, as well as women
m. Nonverbal Group Ability Tests
i. Raven Progressive Matrices
1. RPM one of the best known and most popular nonverbal group tests
2. Suitable anytime one needs an estimate of an individual’s general intelligence
3. Groups or individuals, 5yrs-adults
4. Used throughout the modern world
5. Uses matrices – nonverbal; with or without a time limit
6. Research supports RPM as a measure of general intelligence, or Spearman’s g
7. Appears to minimize the effects of language and culture
8. Tends to cut in half the selection bias that occurs with the Binet or Wechsler Goodenough-Harris
Drawing Test (G-HDT)
9. Nonverbal intelligence test, group or individual-Quick, east, and inexpensive
10. Subject instructed to draw a picture of a whole an and to do the best job possible
11. Details get points
12. One can determine mental ages by comparing scores with those of the normative sampl
13. Raw scores can be converted to standard scores with a mean of 100 and SD of 15
14. Used extensively in test batteries
ii. The Culture Fair Intelligence Test
1. Designed to provide an estimate of intelligence relatively free of cultural and language influences
2. Paper-and-pencil procedure that covers three age groups
3. Two parallel forms are available
4. Acceptable measure of fluid intelligence
iii. Standardized Tests Used in the US Civil Service System
1. General Aptitude Test Battery (GATB) – reading ability test that purportedly measures aptitude for a
variety of occupations
a. Makes employment decisions in govtagencies
b. Attempts to measure wide range of aptitudes from general intelligence to manual dexterity
2. Controversial because it used within-group norming prior to the passage of the Civil Rights Act of 1991
3. Today, any kind of score adjustments through within-group norming in employment practices is
strictly forbidden by law
iv. Standardized Tests in the US Military: The Armed Services Vocational Aptitude Battery
1. ASVAB administered to more than 1.3million people a year
2. Designed for students in grades 11 and 12 and in postsecondary schools
3. Yields scores used in both education and military settings

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao
4. Results can help identify students who potentially qualify for entry into the military and can
recommend assignment to various military occupational training programs
5. Great psychometric qualities
6. Reliability coefficients are excellent
7. Through computerized format, subjects can be tested adaptively, meaning that the questions given
each person can be based on his or her unique ability
8. This cuts testing time in half

Alyssa Louise C. Cabije


BS Psychology
University of Mindanao

You might also like