R4techman 2021 v01
R4techman 2021 v01
∗
PhD, Mathematics, Arizona State University, 2007.
†
Professor Emeritus, University of Nebraska-Lincoln.
Behaviors Technical Manual Table of Contents
Table of Contents
List of Figures vi
1 Introduction 1
1.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to Navigate this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Interpretation of the Content of This Manual . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Analyses of the TTI Success Insights Style Insights Assessment for Diagnostic Purposes 5
ii
Behaviors Technical Manual Table of Contents
iii
Behaviors Technical Manual Table of Contents
References 157
iv
Behaviors Technical Manual Table of Contents
v
Behaviors Technical Manual List of Figures
List of Figures
2.1 Average, Adapted, & Natural Graph Comparison . . . . . . . . . . . . . . . . . . 20
2.2 Neurological Support for Two Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 S-Loreta Image of Gamma Activity in the Frontal Lobes . . . . . . . . . . . . . . 22
2.4 TTI Success Insights Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Bad Correlation v Causation Example . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Information v. Target Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Forced-rank Dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Forced-rank Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Forced-rank Steadiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Forced-rank Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 English US Style Insights Inter-item Correlation Plot . . . . . . . . . . . . . . . . 46
6.1 English US Style Insights Corrected Item-total Correlation Plot . . . . . . . . . . 50
10.1 Vectors in Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.2 Geometric Interpretation of Factor Analysis Parameters . . . . . . . . . . . . . . 88
11.1 Generic CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.2 Example of Item Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
11.3 Example of Item Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.4 Rasch 1-PL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.5 Rasch 2-PL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.6 Rasch 3-PL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.7 Example Boundary Characteristic Curves . . . . . . . . . . . . . . . . . . . . . . . 103
11.8 Example Item Response Category Characteristic Curves . . . . . . . . . . . . . . 104
11.9 Example Item Information Function . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.10 Example Test Information Function . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.11 Example Item Information Function with Reliability Boundary . . . . . . . . . . 107
11.12 Example Test Information Function with Reliability Boundary . . . . . . . . . . . 107
12.1 Generic Item Characteristic Surface for Thurstonian IRT Model . . . . . . . . . . 118
12.2 Generic Item Information Surface for Thurstonian IRT Model . . . . . . . . . . . 119
13.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.2 Example ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
13.3 Natural Dominance Log Odds Plot . . . . . . . . . . . . . . . . . . . . . . . . . . 134
13.4 Intentional Log Odds Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
13.5 CEO ROC Curve: All Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
13.6 Sales1 ROC Curve: All Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.7 Accountant ROC Curve: All Variables . . . . . . . . . . . . . . . . . . . . . . . . 140
13.8 Manager ROC Curve: All Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.9 Sales2 ROC Curve: All Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.1 An Example of Asymmetry in EEG Captured Brain Activity . . . . . . . . . . . . 145
14.2 Confirmation Between the Survey and Brain Response
Respondent Answer: 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14.3 Possible Socially Acceptable Response Based on Brain Activity
Respondent Answer: 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14.4 Brain Response to Confusing Items
Respondent Answer: 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14.5 Brain Response to Double Negative Items
Respondent Answer: 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
vi
Behaviors Technical Manual List of Figures
vii
Behaviors Technical Manual List of Tables
List of Tables
1.1 TTI Success Insights EQ
Technical Manual Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 TTI Success Insights Style Insights
Scale Abbreviation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Common Reliability
Coefficient Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 TTI Success Insights
Information v. Trait Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Average English US Inter-item Correlations
TTI SI Style Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1 Corrected Item-total Correlation Levels Suggested by APA . . . . . . . . . . . . . 49
6.2 English US Corrected Item-total Correlations
TTI SI Style Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.1 Average English US α Coefficient Data
TTI SI Style Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Common Reliability
Coefficient Interpretation, see [59] . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 Average English US ω Coefficient Data
TTI SI Emotional Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.4 English US α-if-Item-Deleted Data
Dominance, α = 0.889 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.5 English US α-if-Item-Deleted Data
Influence, α = 0.865 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.6 English US α-if-Item-Deleted Data
Steadiness, α = 0.855 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.7 English US α-if-Item-Deleted Data
Compliance, α = 0.849 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1 Test-Retest Data Breakdown
TTI Success Insights Style Insights . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2 Test-Retest Correlations 01/01/2015 to 03/25/2020
TTI Success Insights Style Insights . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1 Hypothetical Example of 5 Individuals and a Single Facet . . . . . . . . . . . . . 66
9.2 Schematic View of Teacher Lesson Plan Example . . . . . . . . . . . . . . . . . . 67
9.3 Style Insights Scale X Test-Retest as a Two-Facet G-Theory Model . . . . . . . . 69
9.4 Hypothetical Example of n Individuals and One Facet (“Time”) . . . . . . . . . . 70
9.5 Style Insights Dominance Scale Test-Retest
One-Facet G-Theory Model 1/1/2015 to 3/25/2020 . . . . . . . . . . . . . . . . . 70
9.6 Style Insights Influence Scale Test-Retest
One-Facet G-Theory Model 1/1/2015 to 3/25/2020 . . . . . . . . . . . . . . . . . 71
9.7 Style Insights Steadiness Scale Test-Retest
One Facet G-Theory Model 1/1/2015 to 3/25/2020 . . . . . . . . . . . . . . . . . 71
9.8 Style Insights Compliance Scale Test-Retest
One-Facet G-Theory Model 1/1/2015 to 3/25/2020 . . . . . . . . . . . . . . . . . 72
9.9 Style Insights One-Facet Test-Retest Generalizability and Dependability
1/1/2015 to 3/25/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.1 Examples of Observed Variables from [100] . . . . . . . . . . . . . . . . . . . . . . 81
viii
Behaviors Technical Manual List of Tables
ix
Behaviors Technical Manual List of Tables
xi
Behaviors Technical Manual List of Tables
xiii
Behaviors Technical Manual List of Tables
xiv
Behaviors Technical Manual List of Tables
xv
Behaviors Technical Manual List of Tables
xviii
Behaviors Technical Manual Introduction
1 Introduction
The following manual contains information on the history and development of the concept of behav-
iors from its beginnings to current implementation of these concepts as psychometric assessments.
Included in this manual are a host of mathematical, statistical, and psychometric analyses used to
establish evidence of validity and reliability of the TTI Success Insights Style Insights assessment.
The introductory section contains two major parts. The first is a brief summary of some of the
highlights of the company known today as TTI Success Insights. The second is a synopsis of how
to navigate through this manual. A final short section summarizes how this manual is intended to
be interpreted.
1
Behaviors Technical Manual Introduction
IIC represents the average inter item correlation, see Section 5.1 for a detailed explanation. ITC
represents the average item to total score correlation, see Section 6 for a detailed explanation. α
(the Greek letter alpha) represents the α coefficient usually attributed to Cronbach, see [56]. ω (the
Greek letter omega) represents McDonald’s ω, see [178] for a treatment of several possible internal
consistency estimates. Both concepts are presented in Section 7.
T/R refers to the section on test-retest or temporal consistency, a second part of the presentation on
reliability in this manual. The third portion of reliability presented in this manual is Generalizability
Theory, denoted G-Theory in Table 1.1. EFA is short for exploratory factor analysis which is covered
in Section 10. GRM is the abbreviation used for graded response models, one form of item response
theory particularly suited to analyzing Likert-style response data. Our treatment in Section 11 is
closely aligned with the original work of Samejima in [146].
2
Behaviors Technical Manual Introduction
Thurtsonian IRT is a generalization of the usual item response theory modeling approach to a
multi-variable item response theory model. An important byproduct of this modeling approach is
that it allows for the application of the Thurstonian Law of Comparative Judgment to be applied
to forced-rank data, allowing such data to be transformed into a format useable by a multi-variable
item response theory model. Our approach, presented in Section 12, closely follows that of Brown,
et. al., see, for example, [32, 122], among many others.
Three sections round out the remainder of the analytical portion of the manual. Section 13 presents
our analysis of the relationships to other variables as defined by the American Psychological Society,
see [156]. The TTI Success Insights approach to measuring response process validity evidence
through gamma waves generated by electroencephalograph (EEG) analysis is presented in Section
14. This section closely follows the article [26] which presents the protocols for a similar study
conducted using the TTI Success Insights EQ R assessment. Finally, Section 15 concludes the
analytic sections of this manual with an outline of our approach to consequences of testing evidence
of validity.
Not mentioned to this point are Sections 2 and 3. Section 2 presents an historical synopsis of the
concept of behavior, along with the history and development of the TTI Success Insights Style
Insights assessment. Section 3 presents our view of the current state of literature as related to the
concepts of reliability and validity, as they relate to contemporary assessment development and
evaluation. Finishing the manual is Section 16, which both summarizes the current body of work
and puts in place the plans for future work for the next iteration of this technical manual.
A series of appendices presents information that is important to the overall discussion, yet not
relevant to the current discussion. Some of this information includes historical analyses conducted
over time by TTI Success Insights, see Appendices B through G. There are also appendices our
approach to assessment adaptation, see Appendix A, and a presentation of our assessment devel-
opment process, see Appendix H.
As a final note, the following abbreviations are used throughout the document to refer to the scales
or constructs of the TTI Success Insights Style Insights assessment.
Scale Abbrev.
Dominance D
Influence I
Steadiness S
Compliance C
The results of item analysis are presented in Section 4. These results show that the vast majority
of the items fall in an acceptable range of both discrimination and difficulty based on the American
Psychological Associations definition of the use of corrected item-total correlation as a proxy for
item discrimination and a measure of item difficulty defined by an item endorsement approach, see
the introduction to Section 4.
The results of the inter-item correlation analysis are presented in Section 5.2. Based on the analysis
presented in this section, the inter-item correlations for all scales fall within the acceptable range as
stated in Section 5.1. Similar analyses, presented in Section 6.2, show that the corrected item-total
correlation for all items falls within acceptable ranges.
Measures of consistency are presented in Sections 7 through 9. In all cases, the scales show strong in-
ternal and temporal consistency, along with strong Generalizability and Dependability coefficients.
Further, Section 7.4 presents the so-called “alpha-if-item-deleted” analysis. Generally speaking,
the analysis presents solid results with regards to this measure of consistency.
Sections 10 through 12 are laying out roadmaps to the future for the TTI Success Insights continual
improvement process. Some factor analytic results are included, although the item response theory
results are not. The limited inclusion of results for these areas of interest is largely based on the
timing of the 2020 COVID-19 pandemic. Prior to March, 2020, the TTI Success Insights research
program began an extensive study of tools in our suite of assessments. Under normal circumstances,
the velocity with which we are able to collect data facilitates a rather quick turnaround for research
purposes.
With the onset of the pandemic and various business and community shutdowns across the U.S.
and globally, our assessment collection velocity slowed to a trickle. Forced with a decision to wait
out a lengthy delay in data collection or publish our current analyses, it is decided to publish our
current analysis along with the roadmap to the future. This roadmap is discussed at length in
Section 16.11.
The main body of this manual, presented in Sections 2 through 12, is intended to present what
TTI Success Insights has done or is doing to visit the concepts of evidence of content validity
and evidence of internal structure validity, the first two areas of evidence of validity noted by the
American Psychological Association in [74]. The remaining three types of evidence of validity are
presented in Sections 13, 14, and 15.
Section 13 presents the results to date of the measure of the Style Insights variables against external
variables. The main body of external variables at our disposal are our job classification data,
obtained via our demographics collection, and analyzed in this section. The interested reader is
referred to both Section 13 and Section 3.3 in order to place the entire conversation into context.
Section 3.3 presents an argument why the so-called validity coefficient is, at best, a weak indicator
of any causal relationships that may exist between any two given variables. For further context, it
is often the case that a correlation coefficient is used as a proxy for a validity coefficient.
Section 14 is a particularly interesting section as it discusses the evidence of validity based on
response processing. The American Psychological Association notes the difficulty this particular
area of evidence of validity presents. However, as we note both in Section 14 and in [26], the TTI
Success Insights research team is working at the forefront of the combination of the neurological and
psychometric, employing cutting edge techniques in an attempt to capture the response processing
of the brain while responding to assessment items.
Section 15 presents an often overlooked and undervalued measure of evidence of validity. Conse-
4
Behaviors Technical Manual Introduction
quences of testing validity evidence is intended to show all possible consequences of the use and
interpretation of assessment scores. This includes the good and the bad, with a focus on establishing
a priori protocols to mitigate those possible negative consequences before they arise in practice.
As a final note, the questions of reliability and validity are not easily answered. They take a body of
evidence compiled with solid information gathered over many years, in all areas of interest. In the
following, we present a comprehensive look at the TTI Success Insights Style Insights assessment,
while maintaining a strong focus on the future, not just the current. Our Style Insights assessment
is solid and performs well in many areas. Like all assessments, continual monitoring of performance
is necessary to ensure contemporary interpretations remain as valid today as they were 30 years ago
when this assessment first came to life. The TTI Success Insights continual improvement process
is designed to continually monitor, analyze the information obtained, and modify or replace items
as necessary to maintain first class assessments.
1.4 Analyses of the TTI Success Insights Style Insights Assessment for Diag-
nostic Purposes
This section presents a preliminary discussion of the TTI Success Insights continual improvement
process. As is noted earlier in this work, the Style Insights assessment is administered in a forced-
choice format. Classical Test Theory is the most widely applied set of analytic tools for assessments,
based on the ease of implementation and interpretation. These tools are applied to the forced-rank
choice output earlier in this section.
It is noted, however, that Classical Test Theory tools are not directly applicable to forced-choice
data. The analysis presented earlier is based on separation of the forced-choice from the ordered
response string into separate scale based response data. Strictly speaking, the data, once removed
from the response string, is now free from the zero determinant covariance matrix problem. This is
interpreted as Classical Test Theory tools now may be applied to the data, as long as the analyst
understands that the results are not completely accurate.
With this in mind, TTI Success Insights intends to implement testing strategies to gather data based
on a six-category Likert-style response format in order to more completely analyze the performance
of the individual items, free from the forced-choice influence of the remaining items in a given
frame. The data gathering is somewhat limited, given that, in an attempt to limit test fatigue, a
choice is made to gather information on no more than eight additional items appended to the end
of the regular 24 frame forced-choice assessment.
It is noted that this is an imperfect approach to a challenging problem. However, at this time, TTI
Success Insights is taking a diagnostic approach to analyzing our Style Insights assessment. In that
spirit, we believe that incredibly valuable information is being provided by this approach. Further,
this approach is planned out over the next several months, perhaps years, to be iterative and lead
to a complete analysis of the Style Insights assessment.
The intent is to apply this iterative process to identify high performing items for retention, mid-
range performing items for modification, and any low performing items for replacement. Once this
process has identified the “best” set of 24, or possibly fewer, items for each scale, other processes
may be employed to determine the “best” approach to formulating any new forced-choice frames
as appropriate. These approaches may include such ideas as comparison of discrimination and
difficulty parameters based on a graded response model approach, see Section 11 for a discussion,
and Thurstonian item response theory models may be utilized for analysis of the multivariable item
response theory models, see Section 12 for an introduction.
5
Behaviors Technical Manual Introduction
It is also worth noting that mathematical and statistical analysis alone cannot accomplish the end
goal of this ongoing exercise. Analysis of the individual items in the face of the construct definitions
and content domain must also be part of the consideration.
6
Behaviors Technical Manual History and Theory
This connection to psychology is not readily apparent until we discuss briefly his core philoso-
phy. Pythagoras believed that everything within the universe could be broken down into a simple
numerical form of proportions. This is key on two fronts: primarily, this is the initial basis for
applying mathematics to anything. In a secondary fashion, when Pythagoras applied the question
to humans themselves, he achieved a most unique answer. He determined that the human body
was a separate entity from the mind (soul), essentially that thinking and acting were two parallel
functions of our existence.
In a short period of time (within his own lifetime) other philosophers would take Pythagoras’
theory of Dualism and study it in new and, at the time, highly exciting ways. First among these
was Alcmaeon of Croton. He was known as a medicine man and considered health in perspective
with this dualist philosophy. His ponderences led him to postulate that health itself consists of a
state of balance between certain opposites: hot, cold, wet, dry, internal and external. Also, that
not only was health due to a balance between these things, but disease itself was caused by a
dominance of one of these items. From modern medicine, we see that this was as uninformed as it
is ancient, however, for the time, it was an incredible breakthrough. It is interesting to note that
even today the field of psychology continues to argue the existence of the mind as being separate
from the brain. This dualism issue is not settled.
Moving forward, we see this theme of four structural components can also be traced to the second
century Greek physician, Galen (130 A.D. - 200 A.D.), who was one of the first to describe human
temperaments when he created his four-factor personality typology as sanguine, choleric, melan-
cholic, and phlegmatic. Others classified observable human behavioral traits into far more than
four groupings.
Then in 1921, Carl G. Jung (1875-1961) published Psychological Types [96]. The book was orig-
inally published in German and not until 1971 was it translated into English. This new view of
typology was an attempt t o r econcile t he opposing t heories of Sigmund Freud and Alfred Adler. He
concluded t hat Freud's t heory was based on extroverted views and Adler's on i ntroverted j udgments.
Jung's book i ntroduced t he i dea t hat each person has one of eight dominant psychological t ypes.
Jung proposed f our main f unctions of consciousness: t wo were defined as perceiving or non-rational
(sensation and i ntuition), and t he other t wo as j udging or r ational f unctions ( thinking and f eeling).
These f unctions are each modified by t wo overlying attitude t ypes: extroversion and introversion. The
resulting proposal s uggest t hat t he dominant f unction, along with t he dominate attitude
characterizes consciousness, while t he opposite i s r epressed and characterizes unconscious
characteristics.
Based on t hese definitions, t he r esulting t ypes are:
While contemporary behavioral psychology finds flaws in all of these early theories, it is important
to recognize these early philosophical efforts are worth revisiting in light of modern neurological un-
derstandings. The result is absolutely fascinating. For example, Jung's insight into the unconscious
role of feelings and intuition is now understood as a decision-making precognitive neuro-pathway
8
Behaviors Technical Manual History and Theory
and rational/thinking typology aligns with self-aware cognitive processing. These foundational ele-
ments continue to evolve as our understanding of neurobiology advances. More of these neurological
connections to human behaviors and decision-making pathways can be explored in the following
references and when appropriate, will be expanded upon in context [19], [20], [21], [22], [26], [27],
[49], [48], [50], and [52]
As is true of many of these early behavioral theorists, no actual assessment accompanied their
philosophic theories. For example, Jungian psychology remained for the most part remained an
academic endeavor until two American women, Isabel Briggs Myers and her mother Katharine
Cook Briggs, set out to find an easier way for people to employ Jung’s idea that everyone has
a typology. The resulting assessment is known as the Myers-Briggs Type Indicator (MBTI) and
positions people in one of 16 different groupings.
It is important to note that most of the research supporting MBTI has been produced internally by
the Center for Applications of Psychological Types which has raised concerns of bias and conflict of
interest. Independent sources, including Robert Hogan in his 2015 book “Personality and the Fate
of Organizations” called the assessment “little more than fortune cookie” [87]. Other descriptors
in the literature call MBTI “Pretty much meaningless” [63], “one of the worst personality tests in
existence” [3], and “the fad that won't die” [78]. Even MBTI?s webpage points out that the
assessment should not be used for hiring. It is interesting to note that another popular assessment,
the Big Five personality traits, shows some correlation to MBTI in four of their five scales [11].
William Moulton Marston (May 9, 1893 - May 2, 1947) can be credited with providing much of
the original theoretical philosophy behind the DISC system of personality classification. Raised in
Massachusetts, Marston received his education from Harvard, earning a BA in 1915, an LLB law
degree in 1918, and in 1921, a PhD in psychology. Over the next ten years, Marston worked in
academia, teaching at American University and Tufts University. During this time, and for years
after, Marston continued his own work as a psychologist, developing his DISC Theory. Marston
wrote essays in popular psychology and, in 1928, published “Emotions of Normal People” that
explains the DISC Theory [120]. In this book, he describes human behaviors along two axes
with actions tending to be active or passive depending on the whether the individual viewed the
environment as antagonistic or favorable. By placing these axes at right angles, four quadrants are
formed with each describing a behavioral style: Dominance (D), Inducement (I), Submission (S)
and Compliance (C).
While the theory can be traced to Marston, it is important to notice he once again did not develop
an assessment. That initial task occurred some 20 years after the publication of “Emotions of
Normal People” by a former student of Marston, Walter V. Clarke. The rich story of how this
occurred creates a fascinating journey involving a number of key contributors, including; Prescott
Lecky, William Marston, Walter Clarke, John Cleaver, John Geier and then continues with present
day tools thanks to TTI Success Insights.
Like any good story, it is best told by those most closely connected. In this case we were able to
capture an interview with one of the pioneer users of the TTI Success Insights DISC tools, Peter
Turner. Peter is currently the TTI Success Insights Master Distributor for the United Kingdom
and, as such, was able to provide first-hand insights into the developmental path of DISC. To
maintain the continuity of this interview, we will publish the conversation as it was captured and
then return to key events to expand on details where necessary.
9
Behaviors Technical Manual History and Theory
Clarke did his Master degree as a student of Marston in 1930. But he didn't appre-ciate
some of the philosophical underpinnings of what Marston taught. For example,
dominance is obviously a word that has sexual connotations and anyone can be influen-
tial. This takes on even more significance when you overlay the fact that Marston lived
with two wives. Whilst Clarke believed that people should be able to live according to
their behavioral style, he did not appreciate this particular life style. So, while Marston
provides our foundation for the TTI Success Insights behavioral assessment, Clarke in
particular refined and added greatly to the concept and to our understanding of what
it was that we actually measure.
Clarke didn’t really value a lot of what Marston said. Indeed, he didn't just measure the
four vectors we know he measured at least 2 others. So, Marston was seen by Clarke as
very, very clever but not using the tools in ways that he thought they should be used. But
the starting point for all of this, including Marston, was really tied to a guy called
Prescott Lecky. I originally added him on Wikipedia. Others have taken stuff off and
added other pieces about him since my original contribution, but at least he is there to
be remembered. https://en.wikipedia.org/wiki/Prescott_Lecky#cite_note-4
I couldn’t find any reference for him, maybe, 10 years ago, perhaps when Wikipedia first
came out. But now people recognize him for what he was, a truly inspirational thinker
who started both Marston and Clarke by laying a philosophical foundation. He was a
psychology lecturer at Columbia University from 1924 to 1934, at a time when American
psychology was dominated by behaviorism. He developed the self-consistency theory as
a method of psychotherapy. Self-consistency theory remains relevant to contemporary
personality and clinical psychologists. Michael Stevens in 1992 captured the essence of
his contributions in an article titled: Prescott Lecky: Pioneer in consistency theory and
cognitive therapy [160]. As the story goes, Lecky asked a group of students on the first
day of class, “What is the most important thing to human beings?”
And majority of the students were saying things like laws, survival, food, all those
kinds of things that you think of as being the basic things in life. And he said, yeah,
okay. I understand what you’re saying, but I’ve got a problem with that. So why
would three-quarters of a million American servicemen volunteer to fight in the first
world war? When they signed up in 1917, there were no secrets anymore. They all
knew what they were signing up to do. It was over the top and many were mown
down by machine gun. And then between time, it wasn’t very pleasant either. And
so why would they volunteer? If food, warmth, air, survival, all those things were the
most important things to human being, why would anybody consider doing that? His
students couldn’t think of anything, so he said, I’ve got a theory. And my theory is, that
the most important thing to human beings is not self-preservation, but preservation of
the self-image.
And in other words, if you believe that you should be there, if you believe that it’s right,
or you believe that it’s a brave thing, or you believe that everything, whatever bought
10
Behaviors Technical Manual History and Theory
your beliefs off, you will seek out the opportunity to deliver that belief. Subconsciously
it’s not a conscious thought, it’s subconscious.
This concept actually explained a lot of things to me personally, because when Art
Niemann told me this story, I finally understood, why I volunteered for impossible
tasks. I remember when I was an operations guy at Halfords, the worst district in the
country was under performing, it was in the Southwest and miles away from anywhere
and when the job was offered, I felt my hand going up in the air, I couldn?t help it! I
was drawn to be able to fix that problem. And this is what happens when people believe
they are good at something. It explains why some people do things and others don?t.
Now, obviously Clarke, who was sitting there in that class, thought to himself, if that’s
true, all I’ve got to do, is put some words in front of people and they will instantly be
drawn to certain words. They will tick a word to say that’s like them or not like them.
As a result of that experience, this became the fabric of Clarke’s research that went on
between 1928 to 1948. He started his own company, Activity Vector Analysis (AVA)
which was part of Walter V. Clarke and Associates with the purpose of bringing his
products to market. It is even more interesting to note that at the time Clarke was
majoring in industrial engineering and it appears that he was in on this Lecky lecture
by chance. It was also by chance he was sitting next door to Marston and these events
created a calling for Clarke. But as a result of that key interaction, he changed his
attention from industrial engineering to pursue a psychology master?s degree. The
title of his master?s thesis was “Physical Types as Bases for Variations in Primary
Emotions”. So, that’s kind of the basic start of all of this.
The only reason I know these stories, is because I was trained by Art Niemann and Art
worked for W. Clement Stone of Combined Insurance where he was trained by Walter
V. Clarke himself. The training was composed of a two-week course with Clarke, which
Art said, “Changed my Life”. Combined Insurance ended up purchasing Walter V.
Clarke and Associates.
I remember Art telling me that it was the most powerful time of his life because all of
a sudden, his life changed, as a result of understanding behavior. It made a massive
impact upon him.
So that’s the background of how this came about. Walter Clarke used the Marston
model, but didn’t like the words so he changed the words. For example, dominance first
became aggressive. We know that being aggressive toward a task is okay, but also can
mean being aggressive toward people, so he changed it to assertiveness. In addition to
changing construct words, he created a scoring that measured both highs and lows as a
continuum.
He next changed the term influence to sociability because he believed it as more de-
scriptive. He pointed out that anybody can influence others, so it’s a misnomer to think
otherwise. But social people are drawn to conversations all the time, so he used the
word sociability to describe what we call the “I” factor.
The “S” factor, steadiness, still creates a lot of confusion. For some it implies that a
person is sloth like. If you’re calm and relaxed and move about slowly, people think
that’s an “S” trait. Well it might be, but moving slowly has more to do with your
11
Behaviors Technical Manual History and Theory
energy level. The “S” trait has to do with how you deal with the rate of change; the
rate of change and the propensity to be calm.
Clark thought of the “S” factor as being like the waves in an ocean. When the wind
blows you can be quite tormented, massive waves and people get angry. But then they
calm back down like the sea that’s a tranquil person’s natural state. It’s not tranquil
all the time, but if nothing is happening, no winds are blowing, then they are very
calm. They won?t go looking to make the waves and the way he described it was a bit
like a swan on the surface of the water looking all calm and relaxed, but underneath
their feet are paddling like mad. So, don’t think that the person has nothing happening
inside. Clarke says they keep it locked up. Therefore, the observable behavior of the
“S” appears to be tranquil. Like the ocean without much wind.
The “C” factor was called avoidance. His first thoughts were avoiding, but again, this
connotes that people avoid situations. That isn’t strictly true. They avoid confronta-
tion. So, he settled on the concept of conformance or compliance. Conscientious was
also discussed, but everybody can be conscientious.
This process of finding just the right descriptive words was the result of a massive
research project. Clarke was commissioned by the U.S. Air Force to interview soldiers
returning from the war in 1945 and 1946. The goal was to help find jobs that would suit
them. But rather than just examining their hard skills, Clarke determined and analyzed
their behavioral styles. This experience provided Clarke with a deep understanding of
both different styles and allowed him to refine his descriptive word choices from the
tens of thousands of actual assessments plus hundreds of taped interviews and follow
up conversations.
When he launched his business in 1948 he had solid information to work with and
statistics to back up his ideas. I should also mention that these early assessments used
a list of 81 verbs from which you checked off if they described you. Within those 81-
verb list, Clarke had created not four, but five factors. So, we had the assertiveness,
sociability, tranquility and conformance, but he added another one as well and that was
a maturity factor.
For example, you get a salesperson who based on the first four factors, appears to be
right for this job, but he says the wrong thing at the wrong time. This is an example
of the maturity factor. If you had a low maturity score, then you were more likely
to make these errors. A low maturity factor basically meant that you couldn’t tell
them anything. They were un-coachable. They used to say that the IRA terrorist
were extremely mature, which at first seems to be the paradox. But what he meant
was that they always felt that they were right. They knew in their mind they were
absolutely right. That was the fifth factor. It might be something very similar to our
(TTI Success Insights) present Emotional Quotient assessment that Clarke noticed all
those years ago.
There was another factor that Clarke identified and that was honesty. He developed a
mechanism by which to measure honesty, but ultimately decided it was not beneficial.
This is another example of how Clarke differed from Marston. Marston believed that
first and foremost, you have to tell the truth. In fact, that is the origin of the “Lasso
12
Behaviors Technical Manual History and Theory
of Truth” in Wonder Women. It is interesting how all these things fit together. The
honesty tool was never released and Art said that he never saw it.
Next, let me tell you about the AVA business. Art worked for W. Clement Stone, who
founded the Combined Insurance of America (CIA). I think it is called AXON now.
And I’ve heard lots of stories about Clement Stone, how he ran his business. I’ll just
give you an example of what he was like. He started the company off with, he said, $10.
Well, actually the company, CIA was up for sale in the early 1920s. And he noticed it
was up for sale. At the time, Clement was writing insurance policies. So, he went to the
owner, I said, “I know somebody who could buy this company”. And the owner said,
“Oh, that’s very good”. Clement said, “Would you sell it, if I got somebody who could
buy it?” And he said, “Oh yeah, straight away”. So, Clement asked him how much he
wanted for it? I don?t recall the exact amount, but it was over a million, a big sum of
money even for those days. So, he said, “Right, who is the buyer?” And Clement said,
“I am!” And the owner said “You can’t buy it. Who is your backer?” And Clement
said, “You’ll back me! If you don’t get any other offers, I think I might be your best
opportunity.”
Why? Because Clement was his best salesman. He sold lots and lots of insurance. So,
on that basis, he bought the Combined Insurance of American for $10 on the basis that
he would pay back the loan out of profits in the future. So that’s how that whole thing
was purchased.
Backing up a bit, Art had his degree in something like human resources and on the first
day of work with CIA, as director of HR, he checked in with the reception desk and
was told to go back to the locker room and get changed. He explained that he was the
new HR director, but the receptionist insisted that his job was to take the mail around
to offices. So, he delivered all the mail to people for about three or four weeks. One
day he bumped into Mr. Stone and asked “When am I going to start working at my
director of HR job?” Mr. Stone says, “You are already! You’ve met everybody, you’ve
talked with everybody when you take them coffee. I want you to continue to do that.”
He says, “And this is your job description.” He turned piece of paper over on his desk
and said, “Make my people happy”. That was his job description, “make my people
happy”. So, he continued to go around fixing the lights interacting with people.
What he learned from all that was that it isn’t the big things that upset people in
work. It’s the little things. And by understanding how people behaved and what their
expectations were, he was able to make people happy. All this tied back to Clement
Stone. As a result of his background with AVA, he actually commissioned them to do
some research for him later on. And I found a paper today, which shows that Clarke
was doing this kind of analysis for insurance sales people.
There is a white paper written on that somewhere, which you can find. And so that’s
the result of that. He got Art to go and be trained by AVA. So that’s how all that
kind of ties in with Clement Stone. Now, Clement Stone in his books, wrote in there
saying that the only way that his business grew from being a $10 company to a multi
multimillion-dollar company was by the use of behavior profiles; making people happy
by them doing the job which suits their natural behavior. So, he credited the whole of
his growth to being able to use these tools to first of all, identify what the job was and
13
Behaviors Technical Manual History and Theory
then match behavioral styles to that job. They used a tool called JAR (Job Activity
Report) to define the job, which is the same as our work environment assessment.
It is important to note that the overall ethics of the AVA company, the Walter V. Clarke
company, was that they were there to do good. That was the philosophy they built the
company around. This was a very, very powerful vision. You don’t pigeonhole people,
you don’t have highs and lows. You don’t have good and bad styles. You just create
this continuum of styles that allows you to understand somebody’s preferred style of
behavior. Each of these shapes had a name. So, you didn?t really pigeonhole people but
you did place them within this continuum so you could understand where somebody
was and use this to match them to appropriate jobs.
J. P. Cleaver, “Clipper Cleaver” as he was called, joined Clarke in 1951 and left in 1956
to start his own company. Now according to Art Neiman, Cleaver took all of Clarke’s
research, when he left. He took a cabinet of papers with all his research. You would
think that that would cause major upset, however, within two-or-three years Cleaver
came back to Clarke and asked him to help him to devise a new instrument. So, Clarke
worked with Cleaver and came up with a closed questionnaire instead of the list of
descriptive words Clarke had created. These are the kinds of questions that we see now
and in the format that we use.
The result of this collaboration was a second alternative approach. In August of 1963,
Walter V. Clarke published a paper titled: Forced choice versus free response in per-
sonality assessments. That’s quite an interesting paper, especially when you consider
that Clarke continued to use the open question format. And I think they do to this day.
J.P. Cleaver, “Clipper Cleaver”, went on and used the closed response instrument. So,
Clarke was involved in both assessments. Cleaver also turned the graph so they were
vertical and removed the maturity factor, and went back to using the Marston’s DISC
nomenclature that we see today.
So, Clarke had his hand in these new assessments, but he didn’t agree with the use
of highs and lows and, therefore, stayed with his original open-question approach. I
should add that there never were any patents filed on anything or copyrights because it
basically belonged to Clarke and he didn’t object to their use. Clarke actually helped
JP Cleaver grow his business because it was for the good of humanity, as he saw it. So,
a very strong ethics drove this early work.
Going on from that, Art Neiman started his own company in 1980. And about 12 months
after that Mr. Stone’s Combined Insurance of America bought the AVA company. AVA
subsequently has been sold again to somebody else. I spoke to the new owner once, but
can’t remember who they are right now.
Going back to Clarke’s open question assessment, the number of verbs that you ticked
actually created what he called an energy line. So, in other words, if you ticked a lot
of verbs, you were said to have bags of energy. If you only ticked a few, you didn?t
have much energy. As far as I know, there is absolutely no research to support this
conclusion. I’ve never seen anything to say that that is a valid way of measuring if
someone has any energy or not. Whatever energy is any way. I don’t know, it might be
more accurate to measure if you had a pint of beer or two the night before.
14
Behaviors Technical Manual History and Theory
In preparation for this interview, I found more information, white papers and other
publications. One in particular is a good description of early personality measurements.
This 1965 paper was published in the Journal of Clinical Psychology by Clarke, and a
guy called Peter Merenda [127]. Now I know about Peter Merenda because Art told
me about him. He worked with Clarke for a while and earlier to today I found out he
only died about 12 months ago. He would have been able to tell us lots of things about
all this. I did learn that in 1957 he got his Ph.D. from the University of Wisconsin in
Madison and he was assigned to the Naval Examining Board.
He partnered with Walter Clarke and was a professor of psychology at the University of
Rhode Island where he co-founded the Department of Psychology and Computer Science
during his tenure in academia. He was one of the original members of the American
Psychological Association and was granted the prestigious Samuel J. Massey award.
Over the years, Dr. Merenda developed close relationships with scholars all across the
globe. He and Clarke wrote loads of papers with him. For example, another one I’ve
got here is about the construction of an industrial selection personality test. That’s
riveting reading by the way. That was published with Walter Clarke in 1956 [40]. This
connection with academia adds a lot of credibility to Clarke’s work.
2.1.2 Expanding on Peter’s Interview and Highlighting the Role of Prescott Lecky
Until Peter Turner mentioned a significant contribution of Mr. Lecky to the history of DISC we, for
the most part, had never mentioned him. Upon further investigation, an exciting set of foundational
components were discovered that in many ways lay a much more solid theoretical foundation of
DISC than Marston provides.
We learned that Lecky pioneered the concept of self-help as a method of psychotherapy. This new
theory was built on the belief that the most important motivator for humans was the preservation
of the self-image. He understood that one’s self concept dictated their resulting behavior. His trans-
forming concepts directly influenced Maxwell Maltz’s classic self-help book, “Psycho-Cybernetics”
as well as George Kelly’s book, “The Psychology of Personal Constructs”. In addition, Lecky
inspired several notable students, including Hans Ansbacher, Walter Clarke, Gardner Murphy,
Frederick Thorne, and Carl Rogers [4]. During the winter of 1927-1928, Lecky studied with Adler
in Vienna, visited Jung, and, possibly, met Freud during that same trip [167].
In the 1990s, Michael J. Stevens traced the foundational elements of numerous cognitive-phenomenological
theories back to Lecky’s by preforming a massive search of the literature as well as numerous inter-
views. His findings were first presented at meetings of the American Psychological Association in
1990 and again in 1991. These findings were then published in the Journal of Clinical Psychology
in 1992 [160]. A quote from that paper starts to set the stage for understanding just how influential
Prescott Lecky has been and how his work continues to drive our theoretical understandings.
dissonance, Heider’s balance model, Murphy’s perceptual learning process, and Osgood
and Tannenbaum’s congruity principle. (pg. 808)
So why are we not more aware of this man’s turning point influences? The reasons may lie at the
heart of his own DISC profile. Imagine a man who has an off scale theoretical motivator and a
major “D-C” (potential me-me conflict). Now overlay this profile with the fact that his position as
psychology lecturer at Columbia University was contingent upon the completion of his dissertation
and as such he was given a ten-year window to complete this task.
Lecky’s former student, John Francis Adams Tayler sums up such a scenario when he states in a
biography about his mentor:
Although he had accumulated sufficient data for an acceptable thesis at several points
in his career, he could never bring himself to completing the final details necessary
for the accomplishment. Part of this inefficiency may be attributed to perfectionistic
tendencies which made him unwilling to publish anything which did not reflect his latest
views. He revised his dissertation again and again, but was never completely satisfied
with the results.
At the time of his death, Lecky had only produced two professional paper presentations [107] and
[108], and a book review [106]. However, he left a collection of incomplete manuscripts and a
handful of former students who were determined to eventually make known his name and his Self-
Consistency theory of personality work. As a result of this group determination, in 1945 a 154-page
book was posthumously published under his name in compliance with government regulations on
conservation of essential materials [109]. In 1951 the original manuscript was expanded to 275 pages
with interpretations added by Frederick C. Thorne [110]. Thorne was Lecky’s graduate assistant
from 1938-1939 and a critical admirer of Lecky as a scholar, teacher, and clinician. This 2nd edition
was reprinted again in 1961 [111].
16
Behaviors Technical Manual History and Theory
17
Behaviors Technical Manual History and Theory
persons classified as President, Executive, Salesman, Accountant, Engineer, Teacher, and Machine
Operator. Clarke continues his rationale for comparing these job groupings, by stating, “If this
instrument is a useful tool for selecting salesmen, executives, accountants, or the like, there must
be some significant differences among the patterns obtained from these people,” pp. 384. The
study was able to show that “fairly significant differences” were found between most of the jobs
and where differences were not found, similarities between the requirements of the jobs appeared
to justify the lack of differentiation.
It is interesting to note that from this initial reporting of the assessment construction, Clarke
generated a series of articles that each described the unique qualities identified by his assessment
tool. These follow up articles expand on how these tools can profile: life insurance agents [41],
loan office managers [42], self-made company presidents [43], prediction of occupational hierarchy
[44], AVA as a selector of life insurance salesmen [171]. Upon release of this documentation and
protocols, a host of other predictive publications followed by other authors [88] and [98].
18
Behaviors Technical Manual History and Theory
missioned a systematic series of independent external reviews of reliability and validity on the Style
Insights instrument. The first report was performed by Professor Russell Watson from Wheaton
College in 1986 and repeated again in 1989. Additional studies followed in 1992 and were published
in “The Universal Language DISC: A Reference Manual” in 1993 [173]. That manual was reprinted
and updated again by Warburton and Suiter in 1999. This revised edition included an updated
study by Associate Professor James R. Hall, Chair of the Department of Psychiatry & Human
Behavior at the University of North Texas, Denton, Texas [174].
With the transition of scoring the DISC assessment from Most/Least to using all four item rank-
ings in 2011, a new study was commissioned. Once a large enough sample size was collected using
the new scoring protocols, the next external review was expanded in 2015 when Professor Delwyn
Harnish, Director of Assessment at the University of Nebraska-Lincoln, ran reliability studies on
not just US English, but preformed complete reliability studies on each of the following assess-
ment translations including: Brazilian-Portuguese, Chinese-Simplified, Dutch, English-Australian,
English-Canada, English-South Africa, English-UK, French, German, Hungarian, Italian, Polish,
Portuguese, Russian, Spanish-Americas, Spanish-Spain, Swedish, and Turkish [85]. A similar in-
ternal reliability review was once again performed in 2017, [166] thus maintaining this ongoing
performance appraisal. With Style Insights (a TTI Success Insights DISC assessment report) now
published in over 40 languages, TTI Success Insights will continue to analyze each of these trans-
lations, once a large enough sample size can be collected.
Take a moment and think about debriefing the information provided from the far-left graph, without
either of the other graphs. No doubt the debrief would focus on the dominant nature of the “D”
factor, the rather low “C” and the “I” score that is well above the midpoint. Now add to the story
by including information from the middle (adapted) graph. One immediately sees that this person
20
Behaviors Technical Manual History and Theory
is adapting their “I” factor down rather dramatically. The final piece of the puzzle only emerges
when the natural style is exposed in the far-right column. The natural scores are generated by
using what a person says they ?are not?. The significance of this negation will be expanded upon
in a moment.
While the original research references to support this decision has been lost with the passing of Bill
Bonnstetter, anyone who has debriefed numerous reports can recall scenarios when critical issues
only emerge when comparing the natural and adapted, such as those in Figure 2.1.
One of the clearest historical references in support of this dual world view is found in an article by
Dr. David Warburton, [172] titled: The importance of finding the real person at work. Please note
that he does not say the importance of finding the “right” person, but the importance of finding the
“real” person. Warburton’s paper lays the groundwork for these natural and adapted behavioral
difference when he states:
All of us have developed behavioral styles, which are represented as distinct cognitions,
emotions and actions. The central core of our behavioral style tends to remain stable
because it reflects our individual identities. However, the demands of the work setting
often require different responses that evolve into a work behavioral style. In order to
help people understand themselves, we need to evaluate this disparity. (pp. 2)
Warburton goes on to explain how this separation of perspectives requires an understanding of how
people think about themselves. He grounds his comments in the work of Lamiell [102], [103], and
[95], who points out that human judgements are not framed by contrasting themselves with others,
but in fact frame their self-identity by negation [37]. This literature states that our identity emerges
through a process of articulating what we are not, not what we are. To make this decision-making
process even more fascinating, is that much of this mental processing draws upon implicit mental
processes for which we are not even cognitively aware of at the time. As a result, self-assessment
ratings that expose what a person is not, provides a more direct path to the person’s true view of
him/herself [172].
Warburton and Suiter [174], further exposed the value of this distinction between Graph II (Core
Behavioral Style) and Graph I (Responses to Environment) in a study of 150 middle managers.
At this point, the TTI Success Insights DISC assessment required the participants to simply select
the one item that was most like them and the one item that was the least like them, with the least
being assigned to their natural or core style and the most representing their adapted or response
to environment. (Notice how once again what one is NOT becomes the natural or core style.) The
study found that differences in the two DISC graphs were a predictor of problems at work and at
home. Furthermore, the distinction between acceptance and rejection was found to relate to the
amount of alcohol use, physical health, mental health, job satisfaction and absenteeism. The paper
goes on to point out the immense physical and financial costs of job dissatisfaction that can be
exposed when examining these discrepancies.
While these findings speak volumes for the need for two different graphs, research always seeks
verification from different approaches to the same issue. The verification came about as a result
of a unique assessment improvement protocol that TTI Success Insights was the first to employ.
In 2011, Ron Bonnstetter and Dustin Hebets began collecting real time brain imaging data while
participants took the TTI Success Insights DISC assessment. The basic process in depicted in Figure
2.2. The key to understanding this line of research that tied to the fact that electroencephalogram
(EEG) assessment data showed totally different brain activity between accepting a descriptive word
compared to rejection of descriptive word or phrases.
21
Behaviors Technical Manual History and Theory
Figure 2.3 illustrates three basic different participant responses to a DISC forced-choice assessment
cell. Notice how the intense red gamma activity is dominate on the left side when responding to
a positive reaction and to the right side when rejecting a concept. (Point of clarification, these
images are showing the brain as it would be looking at you, so the left brain is on the right side.)
The images used in Figure 2.3 showing acceptance and rejection sample images were included in
our approved United States patent application. In other words, the patent office has affirmed our
ability to differentiate these two ends of the choice spectrum by issuing, “Validation Process for
Ipsative Assessments”, US Patent No. 9,060,702 [17]. This original patent was reaffirmed when a
similar patent was file and approved by Canada [18].
Figure 2.4 shows how this process has been used to both validate assessment items and to gain ad-
ditional knowledge regarding participant decision-making. This individual selected ?Bold, Daring?
as the most descriptive word or phrase and “Careful, Calculating” as their least descriptive terms.
In other words, the person “accepted” item 1 and “rejected” item 4. Notice how the red area of
the frontal lobes moves from a dominate left side of the brain for their 1st choice and clearly to the
right side for their 4th choice. Clearly there is a difference in how the brain processes this data and
TTI Success Insights has collected an abundance of data to support this difference.
In summary, it is important to note that if responses to the acceptance rankings, (1 and 2) are
22
Behaviors Technical Manual History and Theory
very similar to the DISC scores derived from the rejection rankings, (3 and 4), then the two graphs
would also be very similar. A case can be made that having two graphs that are similar is a
strong indicator that the persons core behavior and their adapted behavioral styles are in harmony.
Those DISC providers who only offer one averaged graph would find affirmation in this scenario.
However, when the two graphs are different, an explanation needs to be explored and both the
historical findings and the neurological data clearly make a case for this additional detailing.
Simply put, DISC behaviors cannot accurately be described using one graph when in fact each of
us make decisions that are based on an initial, precognitive, neurological asymmetry that draws
upon totally different brain processing of acceptance and rejection. Forced-choice data in a single
graph is analogous to comparing apples and oranges as one thing.
The human brain starts a decision by grouping input along a continuum that is documentable in
the frontal lobe asymmetry. In addition, and this is a crucial statement, humans process rejection
input faster and with greater intensity than acceptance! This trait has been instrumental in our
survival as a species. It also means that what we say we are NOT is a stronger reaction and a
clearer indicator than what we say we ARE. Therefore, scoring a person?s 3rd and 4th choices
(least descriptive) actually provided an inverse indicator of who we really are, thus our natural
style comes from items 3 and 4. Our adapted style then is secondarily processed and provided by
scoring items 1 and 2 as an indication of our adapted behavior.
It is always rewarding when observed findings by early researchers, such as Warburton, Suiter,
and Bill Bonnstetter, are later explained by more recent neurophenomenological findings. I could
provide much more about the role of the emotional load of the item and other details, but the
references below can provide those exciting details that I am sure you will want to embrace.
Since the filing and approval of these original patents, TTI Success Insights has successfully pub-
lished over a dozen peer reviewed papers that use this neurophenomenological gamma asymmetry
to document acceptance versus rejection [17], [18], [19], [20], [21], [22], [26], [27], [48], [49], [50], and
[52].
There’s no question about it. I’ve never once in all the thousands of reports I have
personally fed back to people, never once have I had reason to doubt what was reported.
I suppose that’s a kind of self-reporting face validity. And I have only had two people
tell me the report was wrong. And one of those was a person who said, it’s wrong that
you should follow me around. And to find out how I live my life. He really thought
that I had written the report myself by following him around.
The history of DISC is rich and the future is even more exciting as we explore new ways to improve
on these foundational efforts through continued research and development.
23
Behaviors Technical Manual Reliability and Validity
Selecting criteria that are consistent with the theory underlying the construct is easier
said than done. Valid external criteria are hard to obtain, expensive, or both. In
other cases, valid external criteria may simply not exist. Often, collecting criterion
data is impractical, and even when such data are gathered, they may be of questionable
reliability or corrupted by biases (e.g., supervisors’ ratings used in an employment
setting or teachers’ ratings in an educational setting).
This quote is a bit disconcerting. In particular, TTI Success Insights has experienced a great deal of
demand in the secondary education space in recent years. The main indicator of performance used
in that space are students’ grades, which are directly assigned by the teachers, and we are being told
that such data are likely corrupted. We then pose the following question. Is it possible to validate
the use of any assessment in an educational setting? This is not meant to sound alarmist. Rather,
it is intended to spark debate on the issue. The TTI Success Insights suite of assessments has been
used with a great deal of success in this arena, yet we are faced with a seemingly unsurmountable
obstacle. We will come back to this part of the discussion later.
The remainder of this manuscript presents some of the obstacles we see coming out of the academic
world in the various areas of reliability and validity. We view the relationship with the academic
community as a very important one. The academicians develop the theories that assessment devel-
opers and users apply in their everyday work. We seek to strengthen that relationship rather than
create a divide.
another time. The concept of reliability of an assessment has been around since at least 1937 when
Kuder and Richardson introduced the so-called α coefficient for dichotomous items, see [101]. It is
not the intent of the previous statement to give a complete treatment of the history of reliability,
only to establish that the concept is not a new one.
The main issue we wish to discuss here is the lack of a consensus in the world of assessments as to
what is the best approach to measure reliability and at what levels may we consider an assessment
reliable, or rather to have an acceptable amount of evidence of reliability. There have been many
published articles in peer reviewed journals discussing the differences of opinions on this topic. Two
of the more recent ones are [38] and [86]. [38] is a bit more useful in that it discusses several different
approaches and offers some suggestions as to directions one could take. [86] is a nice exposition on
an empirical analysis of the use, or misuse and misunderstanding, of the α coefficient specifically.
[38] offers opinions based on six common misperceptions surrounding the α coefficient and goes on
to suggest ways of using structural equation model based techniques (SEM), such as variants of
McDonald’s ω, to compute reliability estimates. Their six misperceptions are:
2. α equals reliability;
4. Reliability will always be improved by deleting items using “alpha if item deleted” analysis;
In a similar fashion, [86] presents six alleged misunderstandings and then conducts an empirical
analysis of published authors to determine whether the alleged misunderstandings are accurate.
We are not here to pass judgment on the correctness of the arguments nor the adequacy of the
empirical study mentioned above. The reader interested in the arguments or the final analysis may
consult the original works in [38] and [86], respectively. We do, however, want to point to a small
handful of comments made by the authors of both works that are particularly revealing about the
current state of affairs when it comes to internal consistency measurement.
The first issue we wish to bring into the discussion is the idea that Cronbach is the originator of
the so-called α coefficient. Though [38] and [86] present slightly differing views, they generally
agree that [101] first proposed a “reliability formula (called KR-20) that can be used for data on
dichotomously scored items...”, see [38]. [86] states that “Alpha was introduced by Kuder and
25
Behaviors Technical Manual Reliability and Validity
Richardson (1937) for dichotomously scored items”. We point out this difference given that [38]
does not refer the result of [101] as “alpha” while [86] does. This is not a major issue in our opinion,
but does point to an inconsistency in definitions in the literature that may need to be addressed.
Are the formulas for KR-20 and alpha the same, or at least can they be considered close enough
so that this inconsistency is not worth discussing?
It is our understanding that the two measures are equivalent if they are applied to dichotomous
data. However, this does not imply they are equivalent. In fact, KR-20 should not be applied when
data are not dichotomous. In our opinion, this difference is important and should be highlighted,
especially in light of the fact that the publication of origination of this discussion is one highlighting
inconsistencies in the understanding and use of the α coefficient.
[38] traces the history of reliability coefficients to Spearman and Brown, both independently pub-
lishing in the same issue of the British Journal of Psychology, see [158] and [34]. The formula
published has since become known as the Spearman-Brown formula. [86] begins their discussion
with [101]. We would like to point out that our opinion is that both articles are informative and
that we are using this likely minor inconsistency to further and extend the debate that [38] is
introducing.
Range Interpretation
α ≥ 0.90 Consider Shortening
0.80 ≤ α < 0.90 Very Good
0.70 ≤ α < 0.80 Respectable
0.65 ≤ α < 0.70 Undesirable
α < 0.65 Unacceptable
The debate is centered on what is the best measure of internal consistency. The debate is clearly
not settled given that [38] was published in 2015 and [86] was published in 2019. Our attempt to
further the debate is an attempt to get clarification on what is the best approach to establishing
acceptable levels of internal consistency. Our extension is to point out the lack of guidance in all
areas of evidence of reliability and validity.
As another example of issues we see, consider the discussion on acceptable levels of a measure of
internal consistency. For example, the Table 3.1, above, is presented in [59] and reproduced here
for purposes of the current discussion. Additionally we present the following quote from [10]:
How high should values of α (or related indices such as KR-20) be? I hesitate to give
an answer to this question because estimates of α can be affected by many things, such
as the purpose of the test, the heterogeneity of the sample from which it was obtained,
the conditions of testing, and the number of items.
Further we have [131] stating that 0.70 is acceptable during development, 0.80 for “basic research
scales”, and 0.90 for clinical settings are appropriate minimums. [142] recommends a minimum
of 0.80 without the same restrictions as [131]. As a final example, [97] suggests values of 0.95 for
clinical settings. [10] concludes her remarks on recommended levels of reliability with:
26
Behaviors Technical Manual Reliability and Validity
The higher values for clinical decisions reflect the serious consequences of such decisions
and the need for correspondingly higher standards. The same argument for higher
reliabilities could be made in the context of educational tests that are used for student
placement, graduation, or remediation decisions or for licensure or certification tests.
Indeed, any test that will be used to make consequential decisions about students,
clients, patients, or others, should contain as little error as possible.
We agree with the idea that our measurements should contain as little error as possible. We agree
that it is in the best interest of all involved (the developer, the user, the respondent, etc.) for any
assessment to have as little error as possible. The issue we wish to discuss is whether it is possible
to pose a well-defined response to the concept of as little error as possible while, at the same time,
not bringing our industry to a standstill waiting for a solution.
We have attempted to show the lack of agreement in to major areas of measuring internal con-
sistency. First, there is no agreement across the academic community as to the best measure of
internal consistency. Second, there is no agreement on what levels constitute acceptable levels in
the process of establishing evidence of reliability. If this is the case, what is the motivation for the
practitioner or user of such instruments to extend their own knowledge in an attempt to establish
such evidence of internal consistency? If those largely responsible for establishing the standards
are either unable or unwilling to state, definitively, what those standards are, how can the same
groups stand so definitively for what they are not?
review process focused on helping to improve an institution’s mathematical and statistical modeling
approaches. However, in the experience of one of these authors, it often led to confrontational
relationships between the model validation teams and the model building teams and/or model
owners. This was especially true when the model validation team came from an outside source
such as an external consultant or government auditor.
This led one of these authors to suggest to the company he worked for a change in use of language
from a model validation project to a peer review project. That simple change in language made
huge differences in the relationships and proved highly productive during that time. While it is
understood that such a change in language in this space is a really big ask, perhaps it is worth the
discussion.
We do not believe that the intent of the words reliable and valid was to have a black and white
cutoff point, nor do we believe it was intended to be part of answering the question “Is your
assessment reliable and valid?” We choose to follow the APA guidance as presented in [74] and use
the language evidence of reliability and validity. However, as a company we receive inquiries on an
almost daily basis asking if our assessments are reliable and valid.
We do note that [156] differentiates between validity and validation, noting that “the former refers
to the degree to which an assessment fulfills its intended purpose” while “validation refers to the
process of gathering and reporting evidence to evaluate the use of a test for a particular purpose”.
While this distinction is noted here, we doubt the average user, potential client, or test taker has
the time to take to understand the difference.
[5] states that there are five sources of evidence of validity. However, we note that these are not
defined as the sources of validity, see pp. 13:
The following sections outline various sources of evidence that might be used in evalu-
ating the validity of a proposed interpretation of test scores for a particular use.
The italics in the previous quote are added by the authors of this manuscript for emphasis. We
make note of two important implications of the previous quote. The first is the use of the phrase
“might be used” implying that there are potentially other sources of validity. The second is that
the use of the phrase “evaluating the validity of a proposed interpretation of test scores...” This
second is a statement about the nature of the relationship between validity and an assessment. It
is not the assessment itself that is valid, rather it is the interpretation and use of the scores that is
of importance.
The five sources are listed as follows:
Each of these areas is intended to measure a specific aspect of the use and interpretation of as-
sessment scores. An example provided in [5] for test content evidence is as follows, see pp. 14-15.
They state that if we are measuring mathematical ability at a certain level, it is fair to test on all
28
Behaviors Technical Manual Reliability and Validity
mathematical content up to and including that level whether the concepts have been introduced to
the respondents or not. In a different scenario, suppose we are interested in testing the knowledge
base of that same group of students based on delivered curriculum. In this case, the content domain
should be limited to what the students have been exposed to in the delivered curriculum.
We are not dealing with mathematics curriculum. We are dealing with, for example, Style Insights R
and four constructs. We are interested in the content of the test (the questions, items, frames, etc.)
and the construct(s) it is intended to measure, see [5], pp. 14. We therefore need to have a
serious conversation about what the construct definitions are (for all our constructs on all our
assessments) and then make a determination as to whether the content domain of definition is
adequately covered without introducing irrelevant information. These two concepts are known as
construct representation and construct irrelevance.
Conceptually speaking, evidence of validity based on test content is relatively easy to understand.
In practice, it may be quite difficult to establish. As an example with Style Insights, the history
going back to Marston has been well documented by many. However, there do exist some issues
that prevent a full content validity evidence claim based on this lineage. The arguments laid out
in Marston’s work have been debunked by both the psychology and neurology communities. The
understanding of how the brain facilitated emotions in the early part of the 20th century has been
shown to be incorrect by modern technological advances such as the electroencephalogram (EEG).
Further, most of the development of the current iteration of the DISC assessment occurred in the
late 1960s to early 1970s by Geier who never published his findings. These two facts place a difficult
obstacle in the path of any content validity argument. This is not to say that one cannot create an
argument for the content validity of the DISC assessment. It means that we need to find or develop
the theory of a four-factor model that is not based directly on the work of Marston and Geier.
Some may challenge our assertions here. Simply put, the highest form of validation comes from
an independent third party review of our assessments, their use, interpretations, and so on. This
third party reviewer is likely to be some form of academician, and we must approach a project
of this type with the utmost transparency if we expect to be treated fairly. We need to own the
negative and establish processes to correct that which is correctable or find different approaches
and justification for that which is not directly correctable.
Evidence based on response processing is a challenging topic. The APA takes the following position
on this portion of validity evidence, see pp 76 of [156].
Gathering validity evidence based on response processes is perhaps the most difficult
validity evidence to gather because it involves demonstrating that examinees are in-
voking the hypothesized constructs the test is designed to measure in responding to
test items. ... Gathering evidence is difficult because one cannot directly observe the
cognitive processes going on within people’s heads as they respond to test items.
Or, can we? In some sense, there seems to be a pattern in which we are asked to gather certain
kinds of evidence of validity followed by commentary on how difficult, if not impossible, gathering
such evidence is. For example, see the introduction to this chapter and its discussion about criterion
data. We do agree that gathering this type of evidence is challenging, and it is doubtful that many,
if any, assessment developers (academic or otherwise) are even attempting to do so.
However, TTI Success Insights has had an active neurological research group since 2011, and one of
the areas of recent focus has precisely been studying respondent brain reactions to our assessment
items. The current study has focused on the TTI Success Insights Emotional Quotient assessment
29
Behaviors Technical Manual Reliability and Validity
with an anticipated peer reviewed journal submission sometime in late 2020 or early 2021. We have
already had the protocol paper accepted and published, see [26].
The APA offers the following suggestions in light of their statement on the difficulty of gathering
evidence based on response processing.
2. Chronometric analysis;
Several of the listed approaches are non-starters for TTI Success Insights. For example, think
aloud protocols may be an interesting approach, but we could do no more of that type of evidence
gathering than we could with gathering brain images during item responses. Perhaps an interesting
study could come of this, but it would be quite limited.
At the current time, TTI Success Insights does not have the ability to record time to respond to
individual items on our assessments. We do record total time, but that is not what chronometric
analysis requires. However, TTI Success Insights is currently working on many projects to update
our database and computerized assessment approaches. One such project will bring the ability to
record individual item response times. When that time comes we will be in a position to perform
such studies.
For evidence-centered design, this approach is, by definition, part of the design process of an
assessment development. If we ever completely design a new assessment from the very beginning,
we should incorporate this type process. Otherwise, it is not of use to us at this time. Similarly,
evaluating processes used by graders is meaningless in our setting as we do not grade anything.
We are then left with the Other category and mathematical modeling of item difficulty. Our
brain research falls into the Other category and is similar to some suggestions given in [156] such
as monitoring eye movements during task performance. As for mathematical modeling of item
difficulty, TTI Success Insights has incorporated GRM approaches from IRT in our Likert style
response assessments (e.g., Emotional Quotient). We have not specifically viewed these models
from the assessment development standpoint and “treated item attributes as facets”, see [156] pp.
77-78. We are using these models for two purposes. First is to generate a weighted scoring model
for each scale measured by the assessment and second is to gain diagnostic information on the
assessment scales.
The extension of these type models to our forced-rank assessments is more challenging. Our plan
is twofold. We are first gathering data on the DISC and Motivators assessment items in Likert
style formats. We plan to use graded response models or similar to model item difficulty and
discrimination and use that information to determine optimal, in some sense, pairing of D, I, S,
and C items. See the section on graded response models applied to Likert response data from our
DISC assessment for a more complete explanation, [25]. Once these pairings have been established,
we will build forced-rank frames and gather response data on these new frames. The second IRT
30
Behaviors Technical Manual Reliability and Validity
approach will take the (possibly) updated frames and follow a Thurstonian IRT model, or similar,
to analyze the data.
A more complete presentation of the Thurstonian IRT model, including references, is provided
in the section on Thurstonian IRT models applied to forced-rank assessments in [25]. Briefly,
one cannot directly apply IRT models to forced-rank assessment data. Therefore, we transform the
data using the Thurstone’s Law of Comparative Judgment. This new data may be used to estimate
the parameters of a multidimensional IRT model. At this time, our approach has been to use a
multidimensional probit model. A probit model is another name for a model based on a normal or
Gaussian distribution.
Validity evidence based on internal structure appears to be the area of validity for which the most
tools are available and also appears to be the the area most readily implemented. Tools such
as exploratory and confirmatory factor analysis are available and described beginning on pp 73
of [156]. Also discussed are IRT models, multidimensional scaling, and evaluating the invariance
of the assessment structure through the use of differential item functioning. This last area is an
important one in the US for legal reasons. Differential item functioning and similar tools are
intended to measure the invariance of the response patterns of different groups of individuals. For
example, the concept of disparate impact in hiring processes may be partially addressed by showing
that protected classes, such as gender, age, race, etc., are not negatively impacted by assessment
scores.
That is not to say that there aren’t differences in different groups of individuals. Differential item
functioning analysis is intended to identify where the differences occur internal to the assessment
items. It is then up to the assessment developer to provide guidance on how to compensate for
such differences, if they exist, during the use of the assessment scores.
All the aforementioned approaches are implemented for the various TTI Success Insights assess-
ments as they apply. As an example, like IRT models, factor analytic models are not directly
applicable to forced-rank data. The reason for this is technical, and we briefly mention it here.
Factor analysis is also known as analysis of covariance. These processes essentially compare a the-
oretical covariance matrix, with certain assumed structure, with the population covariance matrix
of the assessment data. During the estimation process, it is necessary to invert the population
covariance matrix. Unfortunately, it can be shown that any forced-rank assessment generates a
singular covariance matrix. In simple English, one cannot invert a singular matrix. It is essentially
dividing by zero.
As mentioned previously, we plan to implement the Thrustonian IRT model to help in the analysis
of both the response processing and the internal structure evidence of validity. We do note that, for
diagnostic purposes, we have employed exploratory factor analytic approaches on the Style Insights
scales individually. While this approach is not directly applicable to establishing internal structure
validity, it has been invaluable in identifying those items that perform well and those that do not.
What it cannot do is give us a reason for why certain items underperform given that they are also
influenced by the existence of three other items in the forced-rank frame. See Section 3.5 for a
discussion of the pros and cons of both forced-rank and Likert style scales.
Before going on to the next area of evidence of validity, it is important to note that while factor
analysis is a very commonly employed tool in assessment development and there is a host of pub-
lished articles and books on the topic, there is little guidance on what is a “good” factor analysis.
So we factor analyze our assessment data. Now what? The only guidance we have found on this
topic is to look at a couple basic things. We would like to have simple structure in the factor pattern
31
Behaviors Technical Manual Reliability and Validity
matrix, and we should look at the total variance explained by scale. What is simple structure of a
pattern matrix? What is an acceptable level of total variance explained? The following is guidance
from [5], Standard 1.13, beginning on pp. 26.
We can infer from the literature essentially what we desire in the form of a factor pattern matrix.
However, what we cannot infer is what is an acceptable level of factor loading or what is an
acceptable level of variance explained. To date, these authors have found no definitive statements
on either of these topics. It would seem reasonable that 50% of the variance could be used as a
minimum. It appears, however, that no one is willing to definitively state that this is a reasonable
minimum level of variance explained to aim for. We emphasize the word minimum to clearly
indicate that should be a minimum requirement and not a final goal.
We do note that part of the issue in making definitive statements on a topic such as acceptable
levels of factor loading is that it varies with the number of items associated with the construct
being measured. The variance explained computation based on a small number of assessment
items cannot absorb a weakly loaded item well while the same computation on a large number
of items can. As mentioned in the section on reliability above, more definitive guidance from the
academic community on topics such as this would be appreciated.
We briefly discussed the evidence of validity based on relations to other variables in the introduction
to this section. In Section 3.3, we discuss correlation versus causation and take a more in-depth
look at some of the issues we see with this approach. We briefly summarize a key point. The social
sciences community appears to have a much lower standard for what are considered acceptable
levels of correlation between variables than, say, the physics or mathematics communities. This
seems to cause a problem. Not only are the discussions in the social sciences using correlation as
a proxy for causation, which is dangerous in its own right, but these discussions also accept quite
low levels correlation as explanatory.
Finally we discuss evidence based on consequences of testing. This evidence is based on an evalua-
tion of the intended and unintended consequences associated with a testing program. We provide
the following quote from [156] as evidence of the lack of consensus in the community for this type
of evidence.
We believe this debate to be one of nomenclature, and given that virtually all testing
programs have consequences on some level, it is important to evaluate the degree to
which the positive outcomes of the test outweigh any negative consequences.
Our point here is that while the APA’s viewpoint is that evidence based on consequences of testing
is an important measure to be considered in the overall validity argument, the APA does not review
assessments in a manner similar to the BPS. Given that everyone is not in agreement, how much
time and effort does one expend on an area in which there does not appear to be a consensus?
32
Behaviors Technical Manual Reliability and Validity
This is a clearly ridiculous example, yet correlation, and not necessarily causation, between variables
is one of the most common tools used to establish evidence of validity based on relationships to other
variables. It seems odd to accept correlation as evidence of a causal relationship while claiming that
correlation cannot establish the desired causation. We noted in the introduction of this section the
APA’s comments on criterion related data. In this section we discuss the apparent contradiction
that is the use of quite low levels of correlation to help establish a causal relationship between
assessment scores and external variables.
The authors of this paper are relative newcomers to the world of social sciences. We have each
spent most of our professional lives working in the so-called hard sciences. As noted earlier in
this manuscript, one of us spent many years working in mathematical and statistical modeling and
model validation in the banking and finance sectors. One general rule of thumb or guideline is that,
in practice, correlation less than |0.40| between variables essentially means there is no relationship
between those variables.
Our viewpoint is that when attempting to build a predictive model, we would only keep explanatory
variables that were reasonably well correlated with the response variable and were correlated at
less than |0.40| with the other potential explanatory variables. The thought here is that correlation
really only begins to become meaningful when greater than |0.40|. Another way to think of this is
if we square the correlation coefficient, we get an approximation of the linear variation accounted
for between the two variables. For a correlation at the level of 0.40, only 16% of the linear variation
between the two variables would be accounted for.
It is not the intent of this section nor is it the intent of TTI Success Insights to point to specific
competitors or individuals on this topic. However, a review of several competitors’ claims of
evidence of validity based on relationships to other variables shows that in many cases, claims
of the existence of such relationships is established based on correlation coefficients anywhere from
0.10 up to 0.50. The majority of those reviewed by the authors of this work are well under 0.30.
Putting this into the context of explaining the linear variation between the variables, we see that
claims are made for the existence of (causal) relationships between variables based on 1% up to
33
Behaviors Technical Manual Reliability and Validity
9% of said linear variation being explained. It is the hope of TTI Success Insights to help establish
that explaining less than 10% of the variation between variables establishes a fairly weak argument
for correlation and really does not establish anything in the area of causality.
We do understand that traditionally lower levels of correlation have been accepted in this area, and
we make no claim that all work should be thrown away because of this acceptance. We do argue
that as the future unfolds, we look to raise the standards of what is considered acceptable levels of
correlation to help establish causal arguments. We find it odd that the acceptable level of variance
explained for a scale on an assessment is at a minimum of 50% (at least by some suggestions), while
in the same field less than 10% may be thought of as providing an acceptable level of information.
This is clearly not a complete apples to apples comparison. However, the interpretation should be
clear from the context.
In the spirit of not pointing out a problem and then offering no solution, we offer at least a simple
proposal for a possible better approach to establishing relationships between variables. Rather than
using correlation coefficients, we can use other modeling techniques that may be scored in ways not
directly related to correlation. For example, one of the approaches used at TTI Success Insights
is to establish relationships between set inclusion based on an external variable and the odds of
existing in certain scoring levels on our assessment scales. We do this using a logit model, although
many other possibilities exist. We then measure the success or failure of the models based on a
contingency table style analysis including the use of the receiver operating characteristic (ROC)
curve. We note this is not the only approach, just one that has proven successful for TTI Success
Insights.
Essentially we establish that our variables provide a better than random chance of identifying
randomly selected groups of individuals from a desired target population out of a larger population.
This analysis would be more enlightening were it to be combined with some form of performance
or longevity metrics, at a minimum. In those cases where we have performance data, we have seen
improved performance of our modeling approach in identifying not only the individuals, but also
accurately ranking these individuals based on their performance.
This is not a perfect solution to the aforementioned problem. However, it does extend the analysis
and moves away from using a tool that is not a good fit for the conclusions being drawn.
34
Behaviors Technical Manual Reliability and Validity
2. Career assessment
5. Readiness certification
Each of these assessment applications bring with it diverse consequences resulting in different ethical
and legal concerns. Therefore, it is vital that the assessment users understand both the strengths
and limitations of self-report assessments.
The key to solving any problem is to first clearly identify the concerns. [12], [118], and [135] have all
described the psychometric properties of existing self-report measures that need to be acknowledged
and, if possible, addressed during any continual improvement process, as well as when interpreting
and presenting results. While identification of concerns is crucial, it is important for the reader to
understand that TTI Success Insights is not only aware of such concerns, but is also continually
striving to rectify these concerns whenever possible. A complete discussion of how each potential
bias is being addressed by TTI Success Insights is beyond the scope of this manuscript, a follow on
manuscript addressing each of the following areas is in progress.
As outlined in [12], several particular biases that can influence self-report measures include:
1. Social desirability bias - the tendency to respond in ways that, rather than truthful, are
culturally or socially appropriate, acceptable, or desirable.
2. Reference bias - the tendency to evaluate oneself in comparison to one’s own peer group
rather than to a broader or more objective set of standards. This most likely occurs due to
an individual’s lack of knowledge about groups beyond their own peers.
3. Acquiescence bias - the tendency of survey respondents to answer all questions on a survey
in the affirmative.
4. Consistency motif bias - stems from the tendency of survey respondents to want to maintain
what they consider to be consistency in their pattern of responses. Respondents may apply
their own theory of how test items are interpreted and adjust their responses accordingly.
Faking may also be a problem in self-report methods, especially when the test taker knows that
the stakes are high. When using self-report methods, test users may mitigate biases and faking
through appropriate strategies of test constructing or by triangulating sources of evidence.
A review of [6], [9], [65], and [176] provide additional concerns for which we are aware and prepared
to rectify whenever possible. While most of the following are incorporated to some degree in the
core challenges listed above, it is useful to examine any issue from multiple perspectives. The
following is a partial list of some of the alternatives described in the aforementioned literature.
1. Introspective ability - describes the lack of introspective ability to respond completely accu-
rately to assessment questions. This bias is a concern when examining constructs such as
self-awareness. It may not be appropriate to assume that everyone has an adequate level of
awareness to be able to accurately respond.
35
Behaviors Technical Manual Reliability and Validity
2. Honest/Image management - self-reports rely on the honesty of the respondent and the re-
search must trust the responses to be honestly provided.
4. Rating scales - the use of rating scales implicitly assumes that all participants interpret the
scale in the same manner. One major issue is that different respondents may rank items
differently based on differing interpretation of the magnitude of the point scales.
5. Response bias - the amount of evidence required to affirm or disagree with a statement may
vary depending on respondent experiences or any other of a number of variables.
6. Ordinal measures - almost all self-report measures produce ordinal data. Ordinal data only
tells the researcher the ordering of the ranking, not the distance between. The concept of
ordinal data may be contrasted with interval data which does provide distance information
(but still suffers from similar interpretation issues as that mentioned under rating scales
above).
7. Control of sample - Assessments that are taken online are especially susceptible to the afore-
mentioned biases. There is virtually no control over the environment of the respondent, time
frame, state of mind of the respondent, or any number of other adverse conditions.
In the same argument, it is at least implied that the Likert-style format directly allows for inter-
personal comparisons. We partially agree, but also point out that the Likert format assumes that
the distance between every choice is automatically the same. In other words, the Likert format
assumes that the choice to rank an item 4 or 5 or 6 on a scale (from 1 to 6, say) are all weighted
the same. Stated in a more concrete way, the choice to go between Slightly Agree (4) to Agree (5)
is weighted the same as the willingness to go between Agree (5) and Strongly Agree (6). In the
language of the previous section, Likert-style scales also generate ordinal data.
To propose an even starker contrast, suppose we consider that the same format also weights the
choice to move from Strongly Disagree (1) to Disagree (2) the same as the willingness to go from
Agree (5) to Strongly Agree (6). The authors of this manual have not seen a strong enough
argument that this is, in fact, the case. All questions are not weighted equally across their response
possibilities, hence the reason for the existence of such IRT models as the Graded Response Model.
In other words, we agree that the Likert-style format is relevant and useful if it is used appropriately
and has an adequate scoring model generated with a relevant IRT model or a similar approach that
allows for an analytical weighting scheme to be applied.
One argument against the use of Likert-style response formats is that respondents can be all things
at all times. This is, of course, true. It appears to fall under the headings of Social Desirability and
Acquiescence as noted in the previous section. The so-called “faking” process is also a potential
pitfall to be concerned with. We do note that this highlights one of the advantages of the forced-
rank format. That is to say that the forced-rank approach does not allow one to be all things.
It is also the case that, unless the respondent is extremely familiar with the assessment and its
constructs or scales, it is more difficult for the respondent to simply fake it to obtain a particular
score profile for a given assessment.
Further, for both the Likert and forced-rank formats, modern technology presents the challenge
of the “how do I game the assessment” websites that have popped up over the years. This is
particularly problematic if the assessment(s) in question are part of a hiring process or if the
results are used for training or advancement at a particular company. These prospects place a
great deal of pressure and responsibility on the assessment developer to ensure the development
process constructs assessments and/or scoring procedures designed to identify and deal with such
behavior. As noted, this is an issue for both response formats.
One longstanding critique of the forced-choice format is that the usual classical test theory (CTT)
analytic techniques do not apply to the data generated by such assessments. Further, the extension
of classical test theory to item response theory is also not directly applicable to the output of
forced-rank assessment response data. It is impossible for the authors of this work to disagree with
these comments. In particular, there are mathematical issues with the application of either CTT
or IRT to forced-rank assessment data.
The arguments for why this is the case are beyond the scope of this work. However, there are
two main points. The first is related to both IRT and factor analysis from CTT. In both cases,
analysis requires the use and inversion of the covariance matrix underlying the data, and this is
not possible in the case of forced-rank data. Any full covariance matrix based on forced-rank data
is not invertible, by definition. The second argument is that even in the cases where one can
mathematically apply CTT or IRT approaches, the interpretation of the results is suspect. The
example to keep in mind is any forced-rank assessment in which we have n scales. If we take the
data generated by any n − 1 scales, we may have a perfectly valid and invertible covariance matrix.
The data in those is still impacted by removed scale and any interpretation based on the n − 1
37
Behaviors Technical Manual Reliability and Validity
Only by deleting a variable or variables can strict dependencies be removed but it will
be appreciated that the variables left still have shared specific variance, etc., and so
problems of interpretation remain.
The authors in [94] are generally quite negative towards forced-rank data, but mostly due to the
the use of the data generated to justify the reliability and validity of those assessments. In the
abstract to [94] the authors at least state the following:
This is not to say that ipsative tests have no utility but that the claims made for
their validity and reliability and their applicability to inter-individual comparisons are
misleading.
The authors of this work would like to take this moment to note that this is one of the main themes
we are trying to address. The academic community has, over the years, been very vocal about
their critiques of assessments without offering solutions to the problems posed. It is not logical to
argue with the claims in [94]. It is logical to ask for those who actively research these areas to help
pursue solutions to the problems they uncover, not just complain that others do not follow a not
well-defined set of rules.
The earlier reference to the Thurstonian IRT process is one approach to addressing this issue as
it relates specifically to forced-rank assessment formats. It took some 20 years, but a group of
academics took the time to study and uncover possible solutions to the problems pointed out in
the 1988 paper [94]. The work presented in [30, 31, 32, 122, 123], among many others, presents a
bridge between the CTT/IRT world and that of forced-rank assessments. The good news is that
we see some movement in the academic community to address an issue in that will be to the benefit
of the business community, and it is appreciated. The not as good news is that we are still left to
interpret the output, meaning the use of these models is still subject to the question of whether
the resulting output is good enough.
As an example, one possible output of the work previously mentioned is a 2-dimensional surface plot
of a so-called item information surface. In theory, this should be analogous to the item information
function in the 1-dimensional case. However, there is little even in the 1-dimensional case to provide
guidance on what is a “good” item information function. Again, in theory, we should be able to
interpret something about the overall reliability of scales in relation to each other, yet it is unclear
what is an acceptable level of information.
At TTI Success Insights, we employ both approaches, depending on the intended purpose of the
assessment. As an example, in the case of the Style Insights assessment, we are attempting to
measure observable behavior in a self-report form. The DISC model is the underlying model of the
assessment and has as a basis for behavior the four variables D, I, S, and C. This basis is used to
approximate a lower dimensional model to the true behavior model. While the various behavior
variables are independent of each other, they may, in some cases, be equally attractive to the
assessment respondent. In an attempt to differentiate, TTI Success Insights believes a forced-rank
model is best suited to capture such differences.
In contrast, consider the Emotional Quotient model in which the scales are not as independent.
What we mean by not as independent is that, in some cases, a scale may depend to a certain extent
38
Behaviors Technical Manual Reliability and Validity
on another scale measured by the assessment. For example, it may be the case that one cannot
really be able to self-regulate if one does not first have at least a moderate level of self awareness.
In this case, it is perfectly acceptable to have these scores correlate at a higher level (as compared
to DISC); and, thus, TTI Success Insights believes a Likert-style response format to be the better
choice in this case.
39
Behaviors Technical Manual Reliability and Validity
should not rely on their data and outputs by organizations such as the APA, and yet we apparently
cannot get away from them.
40
Behaviors Technical Manual Item Difficulty
The properties of item discrimination and difficulty play fundamental roles in deter-
mining the reliability and validity of the final scores generated by the assessment.
Without adequate discrimination across the items of the assessment, the assessment
cannot generate scores that are valid and reliable, regardless of the target trait levels
of the individuals in the population. Without an appropriate level of difficulty across
the items of the assessment, the assessment cannot generate scores that are valid and
reliable at particular points of the target trait continuum.
If one defines the psychological trait of interest for a given psychometric assessment to be the
target trait, item discrimination is a measure of how well the response categories of an assessment
differentiate between individuals at different points along the target trait continuum. In contrast,
item difficulty measures the target trait level at which the information provided by the item is at
its maximum. This concept may be better explained through the use of an example.
Consider the information presented in Table 4.1. This table contains a fictitious example comparing
the level of information provided by a set of items on an assessment and their difficulty, or ability to
41
Behaviors Technical Manual Item Difficulty
measure the target trait. Figure 4.1 presents the data from Table 4.1 as a scatter plot of Information
v. Target Trait. The interpretation is as follows. Point A shows that item A provides a moderate
level of information at a low target trait level or low level of difficulty. Contrast this
with point D which shows a similar level of information at a high target trait level. Points B and
C show high and low levels of information at moderate levels of the target trait, respectively. Level
of target trait indicates the position along the target trait continuum at which information about
an individual is provided. The height of the circle indicates the amount of information about the
individual that is provided at a given level of target trait.
B
0.8
Information
0.6
A D
0.4
0.2
C
0
0 0.2 0.4 0.6 0.8 1
Target Trait
4.2 TTI Success Insights Style Insights Forced-rank Item Analysis Results
Based on the previous discussion, we desire scales that cover a broad range of the trait measured by
the scale with higher discrimination values desired over lower ones. In other words, the more dis-
criminating an item is at a given difficulty level, the better that item is at discriminating individual
respondents at that given level of difficulty.
With the previous discussion in mind, all four scales appear to cover broad ranges of item difficulty
at lower to moderate ranges of discrimination. In an ideal world, one may desire higher levels of
discrimination generally speaking. Given that the current analyses is completely diagnostic
42
Behaviors Technical Manual Item Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
in nature, a major focus of the future development of the TTI Success Insights Style Insights
assessment is on increasing the levels of discrimination while maintaining broad coverage along the
trait continuum.
1
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
43
Behaviors Technical Manual Item Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
44
Behaviors Technical Manual Inter-item Correlation
K r̄
αse = . (5.1)
(1 + (K − 1)r̄)
The representation of αse in (5.1) shows that the standardized α coefficient is a scalar multiple
of the average inter-item correlation coefficient. For the reader interested in more on this topic,
consult any of [15],[137],[143],[159],[162], and [178], among many other possibilities.
Of particular note, it is argued in [162] that average inter-item correlation does not reflect the
variability among the inter-item correlations. In agreement with this line of reasoning, it is the
position of TTI Success Insights that a single measure of reliability and/or internal consistency is
not enough information to make fully sound decisions. As discussed further in Section 7, multiple
measures are reported. This is consistent with the findings of [162], although the current analyses
divert from the details of the information suggested in that article.
45
Behaviors Technical Manual Inter-item Correlation
The analysis of this section focuses on two main areas, essentially the mean and standard deviation
of the inter-item correlations. We are interested in the mean falling in the ranges mentioned in
the previous paragraph, while at the same time having as little variation as possible across the
individual inter-item correlations, see [39] for an in depth discussion of the reasoning. In short, a
good mean addresses the internal consistency aspect of reliability. Small standard deviation leads
more into the aspects of validity which is discussed in Sections 6 and 10.
The authors in [39] present a quote of B. F. Green that articulates well the idea we are attempting
to describe. See [79] for the original reference.
Put another way, to ensure uni-dimensionality, almost all of the inter-item correlations
should be moderate in magnitude and should cluster narrowly around the mean value.
B. F. Green articulated this principle most eloquently, stating that the item intercor-
relation matrix should appear as a “calm but insistent sea of small, highly similar
correlations.” [39], pp. 316.
The vast majority of the items fall well within the desired ranges. The small handful of items that
do fall outside acceptable smaller boundaries are still well within the larger range of 0.15-0.50 that
some authors state as reasonable. As is noted elsewhere in this manual, no single analysis is used
to determine the fit of a single item or group of items. Any items falling outside the stated desired
ranges are noted for further analysis in other sections of this manual.
46
Behaviors Technical Manual Inter-item Correlation
Item/Scale D I S C
Item 1 0.26 0.21 0.20 0.19
Item 2 0.24 0.21 0.20 0.18
Item 3 0.25 0.21 0.19 0.19
Item 4 0.25 0.22 0.20 0.19
Item 5 0.26 0.20 0.19 0.18
Item 6 0.25 0.21 0.20 0.19
Item 7 0.25 0.21 0.20 0.19
Item 8 0.25 0.20 0.20 0.18
Item 9 0.25 0.20 0.20 0.19
Item 10 0.25 0.20 0.20 0.19
Item 11 0.25 0.21 0.20 0.19
Item 12 0.25 0.20 0.20 0.18
Item 13 0.25 0.21 0.20 0.19
Item 14 0.26 0.20 0.21 0.19
Item 15 0.25 0.20 0.19 0.18
Item 16 0.24 0.21 0.19 0.19
Item 17 0.25 0.22 0.20 0.18
Item 18 0.25 0.20 0.19 0.19
Item 19 0.26 0.20 0.19 0.18
Item 20 0.26 0.21 0.20 0.19
Item 21 0.24 0.22 0.20 0.19
Item 22 0.25 0.21 0.20 0.18
Item 23 0.25 0.22 0.20 0.19
Item 24 0.25 0.21 0.20 0.19
Average 0.25 0.21 0.20 0.19
It should also be noted that we are looking to keep the departure from the mean to a minimum and
hence, in some cases, while the score itself is in an acceptable range, its departure away from the
true mean may be too large. In these cases, it may be worth considering replacement or potential
removal of an item. This determination should not be made based solely on item correlations but
is made based on a host of evidence as presented in this report.
Definition 5.1. A monotonic function, f : A 7→ B, is a function between the ordered sets A and
B that preserves or reverses the given order.
For the reader unfamiliar with ordered sets and set theory, a (totally) ordered set is a set X such
that for any x, y ∈ X, x ≤ y, y ≤ x, or both, in which case x = y. A concrete example of an
ordered set is the set of all real numbers R. To make things a bit more explicitly clear,
47
Behaviors Technical Manual Inter-item Correlation
As mentioned in the first paragraph of this section, the attenuation paradox states that validity is
not a strictly monotonically increasing function of reliability. In more concrete language, increases
in reliability do not lead to a guaranteed increase in validity. In fact, it is quite possible to find
oneself in a situation where a push to increase reliability actually leads to a decrease in validity.
The following quote is taken from [39], pp. 316.
Is it not desirable, therefore, to retain highly intercorrelated items in the final scale? No,
it is not. This is the essence of the classic attenuation paradox in psychometric theory.
Simply put, the paradox is that increasing the internal consistency of a test beyond
a certain point will not enhance its construct validity and, in fact, may occur at the
expense of validity. One reason for this is that strongly intercorrelated items are highly
redundant: Once one of them is included in the scale, the other(s) contribute virtually
no incremental information. ... Accordingly, a scale will yield far more information -
and, hence, be a more valid measure of a construct - if it contains more differentiated
items that are only moderately intercorrelated.
The moral of the story being told here is that lower levels of inter-item correlation tend to enhance
validity, especially when considering larger constructs and assuming the items in question are
measuring the appropriate construct and the inter-item correlations are at reasonable levels.
48
Behaviors Technical Manual Item Total Correlation
We may then define the item total correlation of the ith item to be the usual correlation computation
between Xi and X:
IT Ci = cor(Xi , X − Xi ). (6.3)
The calculation in (6.3) removes the influence of the ith item and now provides a better estimate
of the relationship between the response to the ith item and the total score (with the ith item
removed). This, in turn, provides a better overall estimate of how well the ith item measures the
construct in question.
49
Behaviors Technical Manual Item Total Correlation
At the time of the writing of this manual, TTI Success Insights had not found a widely endorsed
range of scores for item-total or average item-total correlations. However, some general guidelines
state that scores between 0 and 0.19 indicate an item does not discriminate well. Values between
0.20 and 0.40 show a reasonable level of of discrimination. Finally, values above 0.40 show strong
discrimination properties. Along a similar line of reasoning, [136], pp. 129, presents the information
that is summarized in Table 6.1.
In addition, it is worth noting that item-total correlation scores that are too high, whatever that
may mean, are potentially indicative of items/scales that are overly homogeneous and potentially
deserve further attention. For example, if a subset of items scores well above the average item total
correlation score for a scale, one may want to highlight those items for further analysis in the factor
analysis portion of the study as they may be redundant, more than one of the items providing little
additional information.
In Section 4.1, partial results of the corrected ITC is presented as part of the item analysis results
of that section. Item analysis is a quite encompassing topic given its relative simplicity of imple-
mentation. Several items are worth noting here. In Section 5.1, a brief quote attributed to B.F.
Green is presented. His opinion is that the item correlation matrix should present as a series of
very similar and non-volatile correlations. Put a different way, the correlation between any two
items should not be far way from the mean item correlation value of all items of that scale, if they
are to be adequately measuring the desired construct in a meaningful way.
This concept of staying near the mean value has a natural statistical measure called the standard
deviation or the square root of the variance. It is then the desire for the item correlations, whether
measured as item correlations, average inter-item correlations, or (corrected) item-total correlations,
to have a solid mean with as small a standard deviation as is reasonable in a given situation.
It is worth discussing multiple scales on this analysis. As is seen in Figure 6.1 several scales have
50
Behaviors Technical Manual Item Total Correlation
items at or above the assumed upper bound on the corrected item-total correlation. At the same
time, several items are at or near the lower bound shown in the figure.
Item/Scale D I S C
Item 1 0.24 0.40 0.26 0.30
Item 2 0.62 0.46 0.39 0.59
Item 3 0.56 0.38 0.51 0.35
Item 4 0.51 0.21 0.44 0.26
Item 5 0.36 0.59 0.57 0.55
Item 6 0.49 0.47 0.44 0.43
Item 7 0.50 0.31 0.36 0.44
Item 8 0.41 0.50 0.41 0.53
Item 9 0.54 0.59 0.40 0.33
Item 10 0.46 0.54 0.39 0.25
Item 11 0.56 0.26 0.44 0.29
Item 12 0.43 0.59 0.42 0.44
Item 13 0.46 0.34 0.40 0.42
Item 14 0.31 0.64 0.11 0.40
Item 15 0.50 0.55 0.51 0.47
Item 16 0.59 0.38 0.55 0.27
Item 17 0.55 0.24 0.37 0.51
Item 18 0.55 0.48 0.51 0.40
Item 19 0.32 0.57 0.52 0.50
Item 20 0.27 0.42 0.31 0.38
Item 21 0.57 0.21 0.45 0.29
Item 22 0.53 0.42 0.39 0.50
Item 23 0.52 0.24 0.44 0.40
Item 24 0.45 0.37 0.26 0.27
Average 0.47 0.42 0.41 0.40
Std. Dev. 0.10 0.13 0.11 0.10
The concepts of small variation in item correlations holds here as well, see Section 5.3. In other
words, while the average is at a reasonable level, the variation across the correlation scores for each
scale is a bit more than desired. While low corrected item total correlation is likely indicative of a
questionable item, further analysis is required to determine the root cause. Any items at or near
the lower or upper boundaries are marked for potential further exploration.
average corrected item total correlations in range of providing a moderate amount of information
according to the guidelines from the APA as presented in Table 6.1.
52
Behaviors Technical Manual Internal Consistency
To the extent that the scale is subject to such random errors, it is said that the scale is
not reliable.
In the quote above, random error refers to errors that are considered to be random because people
do not know or choose not to know their source. It is not the case that we are in the setting
of quantum physics where spooky action at a distance guides the theory, at least according to a
paraphrased description once given by Einstein regarding the then new ideas of quantum theory.
Instead, it is acknowledged that random errors exist and a choice is made to measure the error
rather than seek out the source of the error. After all, if the error associated with a given thing is
small enough, we may disregard it.
The APA also makes the following statement, again in [73], chapter 2, page 22:
Validity is more important than reliability to be sure, but reliability is generally con-
sidered to be one of the most important criteria against which measures, and more
importantly, scores that result from them are evaluated.
Finally, we present two quotes that are intended to show the relationship between reliability and
validity from the viewpoint of the APA. The first is regarding what it means for a test to be
completely unreliable, [73], chapter 2, page 23:
... a completely unreliable test can measure nothing because the numerical scores that
are assigned to the examinees are essentially random numbers and will not bear a
consistent relationship to any other attribute among the individuals being assessed.
...[Validity] refers to the degree to which an assessment fulfills its intended purpose and
test results are appropriately interpreted.
Combining the two thoughts, it appears that if an assessment cannot be considered reliable, it
cannot be valid. If the numerical scores based on an assessment do not bear a consistent relationship
to any other attribute among those being assessed, the assessment cannot be fulfilling its intended
purpose. Put another way, an assessment cannot be valid if it is not first reliable. The question
may be posed in reverse. Can an assessment be reliable if it is not first valid? The answer to this
question is yes.
The authors of this manual have used the example of a home oral thermometer in the past. Consider
such a thermometer. If the thermometer consistently measures the same temperature under the
same or similar circumstances, then it is reliable. If the thermometer readings are also correct, it
is then considered valid. A slightly more technical way to state this is reliability is a necessary, but
not sufficient, condition for validity.
53
Behaviors Technical Manual Internal Consistency
Many discussions have occurred over the years as to what is acceptable internal consistency and
how to best report it. Cronbach himself suggests that the α coefficient is not the best measure
of internal consistency and suggests better measures exist, see [57]. The authors mentioned in
[15], [137] ,[143] ,[159], [162], and [178] offer alternatives, among which are reporting α coefficient
with confidence intervals as presented in Section 7.2.1. Additionally, coefficient ω, with confidence
intervals is also presented, see Sections 7.3 and 7.3.1.
Est./Scale D I S C
α 0.890 0.865 0.855 0.850
S.E. 0.001 0.001 0.001 0.001
Lower CI 0.889 0.864 0.854 0.849
Upper CI 0.891 0.866 0.856 0.851
For interpreting reliability coefficients in general, Table 7.2 is presented. An exact source for the
table is difficult to track down. It appears that a version of this was originally published in the
first edition of [59]. However, on page 184 of [10], the author states:
How high should values of α (or related indices such as KR-20) be? I hesitate to give
an answer to this question because estimates of α can be affected by many things, such
as the purpose of the test, the heterogeneity of the sample from which it was obtained,
the conditions of testing, and the number of items.
Range Interpretation
α ≥ 0.90 Consider Shortening the Scale
0.80 ≤ α < 0.90 Very Good
0.70 ≤ α < 0.80 Respectable
0.65 ≤ α < 0.70 Undesirable
α < 0.65 Unacceptable
The author in [10] goes on to present several examples of the differing opinions on what constitutes
acceptable levels of evidence of reliability based on the α coefficient or similar measure. For example,
[131] states that 0.70 is acceptable during development, 0.80 for “basic research scales”. and 0.90
for clinical settings are appropriate minimums. [142] recommends a minimum of 0.80 without the
55
Behaviors Technical Manual Internal Consistency
same restrictions as [131]. As a final example, [97] suggests values of 0.95 for clinical settings. [10]
concludes her remarks on recommended levels of reliability with:
The higher values for clinical decisions reflect the serious consequences of such decisions
and the need for correspondingly higher standards. The same argument for higher
reliabilities could be made in the context of educational tests that are used for student
placement, graduation, or remediation decisions or for licensure or certification tests.
Indeed, any test that will be used to make consequential decisions about students,
clients, patients, or others, should contain as little error as possible.
It is beyond the scope of this document to argue the merits of Table 7.2 or the merits of any
particular bounds that have been discussed over the years. We present Table 7.2 along with
the comment that some authors suggest any reliability coefficient with value exceeding 0.70 to
be acceptable while others suggest much more conservative estimates. The author in [59] states
unequivocally that these bounds are his preference and that “I cannot defend them on strictly
rational grounds.”
As discussed in Section 3, the lack of consensus and, at times, guidance from the academic com-
munity on subjects such as validity and reliability of psychometric assessments places a potential
barrier in front of practitioners. Organizations such as the APA or National Council on Measure-
ment in Education, the Mental Measurement Yearbook, etc., have great influence in these areas.
Yet, even in [74], there is little to no guidance as to what constitutes an acceptable value of an α
coefficient. It is then left to the judgment of the developer, using the available tools, to determine
whether a scale shows an acceptable level of evidence of internal consistency reliability.
Combining the information in Table 7.2 with the results presented in Table 7.1, it is seen that the
TTI SI EQ assessment scales all show solid evidence of internal consistency as measured by the α
coefficient. Additionally, the standard errors (S.E. in Table 7.1) are all relatively small with respect
to the magnitude of the coefficients leading to relatively tight confidence intervals. One is then led
to conclude that, based on the α coefficient analysis, that all 5 scales of the TTI Success Insights
Style Insights assessment score in the upper end of the Respectable range to the Very Good range.
Est./Scale D I S C
ω 0.898 0.866 0.866 0.855
S.E. 0.005 0.006 0.006 0.006
Lower CI 0.887 0.853 0.855 0.843
Upper CI 0.905 0.878 0.878 0.867
Combining the information in Table 7.2 with the results presented in Table 7.3, it is seen that the
TTI Success Insights Style Insights assessment scales all show solid evidence of internal consistency
as measured by the ω coefficient. The scores range from a low of 0.855 for Compliance to a high of
0.898 for Dominance. The standard errors (S.E. in Table 7.3) are all relatively small with respect
to the magnitude of the coefficients leading to relatively tight confidence intervals. It is again the
case that the scales of the TTI Success Insights Style Insights assessment show reasonable evidence
of internal consistency based on the ω coefficient analysis.
57
Behaviors Technical Manual Internal Consistency
evidence of validity seems confusing. On the other hand, computing the α coefficient to measure
evidence of internal consistency is straight forward. Even if we move to the more complex mea-
sures of internal consistency such as ω, black-box software packages exist for the average person
to pass some data through and get a result. Tables such as that shown Table 7.2 exist to lend
immediate interpretation. It is tempting, therefore, to put the best foot forward in an area that is
easy to understand for the assessment developer as well as the end user of any product based on
the underlying assessment.
Then, as discussed in Section 5.3 and [112], the Attenuation Paradox comes into play. Unfortu-
nately, this is a counterintuitive concept, abstract and even esoteric to some, that tells the assess-
ment developer that it may actually be necessary to lower measures of internal consistency in order
to obtain acceptable levels of evidence of validity. The concept of lowering internal consistency
measures to improve evidence of validity measures is more related to α than ω based on the fact
that α is more directly related to inter-item correlations.
For example, one may improve internal consistency measures by increasing the inter-item correla-
tions or by increasing the number of items assessed. In either case, the realization is to increase
the homogeneity of the assessment, thereby decreasing the likelihood that the construct domain
is any more adequately covered at the higher level of measure of internal consistency. The end
result, therefore, is that increasing homogeneity (internal consistency) decreases the coverage of
the content domain (validity).
It is not that the conclusion is that it is necessary to decrease internal consistency to obtain
validity. Rather, it is the conclusion that to adequately cover a content domain it may be necessary
to construct a less homogeneous assessment. The end result is that while internal consistency
measures may decrease, validity measures increase. Reliability is an important measure, but validity
is more so. If necessary, sacrificing measures of reliability to obtain validity develops a better overall
assessment, at least according to the currently accepted standards of the APA and authors such as
[39, 112], among many others.
59
Behaviors Technical Manual Temporal Consistency
One should note that stability of scores is defined as persistent relative standing within
the group rather than consistent absolute value of score.
We interpret this statement to mean that it is to be expected that some differences may exist in
scores on multiple applications of the same assessment over time. What is truly important is that
the relative standing among the individuals is generally expected to remain consistent across time.
At the end of the day, this is a lot of work to say that the correct tool to use is the standard Pearson
correlation coefficient to measure the differences between individuals’ scores from time 1 to time 2.
A major concern of the test-retest methodology is whether the quantity being measured is the
reliability of the assessment or a measure of the memorization skills of the assessment respondents.
The author of [73] states that there are (at least) four factors that memory may exert influence on
and impact measurement of evidence of the reliability of an assessment. These are:
We address each of these topics one at a time. For the length of time between administrations of an
assessment, it is clear that if the time between assessments is too short, there is a higher likelihood
of memorization of items. This would then potentially inflate estimates of reliability. The length of
time should be long enough that memorization becomes unlikely while not so long as to have major
life changes occur that may differentially impact a respondent’s scores with respect to behavior.
This assessment is measuring behavior styles. While it is possible for the behavior an individual
to change over time, we do not expect changes over relatively short time periods. Waiting for an
exceedingly long time between administrations may cause an underestimate of the reliability of the
assessment.
The only guidance the APA provides on the subject is as follows, see [73], page 29:
60
Behaviors Technical Manual Temporal Consistency
... the time period between the test and the retest should correspond to the length of
time between typical testings and the subsequent behavior that the test attempts to
predict.
It is not clear that the TTI Success Insights Style Insights assessment is predicting behavior in the
sense mentioned in the above quote. What should be clear is that the main concern here is the
length of time between assessments be such that memorization is unlikely. Further one may be able
to implement procedures intended to identify potential abuses of the assessment.
The current length of the TTI Success Insights Style Insights assessment is 24 frames. These
frames are split along four scales and the respondent is asked to rate each statement in relation
to the remaining statements in terms of how much the statement is a reflection of the individual’s
immediate reaction to the statements. It is the opinion of TTI Success Insights that the combination
of an adequate elapsed time between administrations of the assessment and its current length and
makeup are sufficient to address the majority of rote memorization techniques.
The 24 frames on the TTI Success Insights assessment are currently presented as a four-item
forced-rank frame. We again note that the length of time between and the length of the assessment
discourage memorization. We do note that while it is possible to memorize the constructs and their
corresponding statements, we feel it is unlikely that a large enough proportion of any test-retest
population has the desire to do so.
Finally, the nature of the characteristics being measured is addressed as follows. This assessment is
measuring behavior. It is possible that memorization becomes a problem, but TTI Success Insights
feels this is unlikely. Given that this assessment is measuring behavior, it is more likely that the
state of emotions of the individual at the time of administration may directly impact the reliability
of the assessment and we address such issues in Section ??. Memorization issues are unlikely to be
a major problem for TTI Success Insights Style Insights assessment.
A second area of concern related to temporal consistency is called the differential practice effect.
Differential practice effect is the concept that performance, generally speaking, improves with
practice.
According to the APA, see [73], page 29:
As mentioned in the note above, if we have an adequate response to the time issue, this is likely
more than enough to address the differential practice effect. The fact that TTI Success Insights
assessments (in particular Style Insights) are not measuring performance implies that as long as the
respondents are not overly familiar with the assessment itself, we should be adequately addressing
any issues related to differential practice effects.
demographic information available at the time of publication of this manual is presented in Table
8.1. As is seen, there is approximately a 50/50 split on the male/female breakdown in the population
of slightly more than 33,000. The mean time between administrations of the test is slightly more
than 16 months, with the median time between administrations just more than 14 months.
1/1/2015 to 3/25/2020
Elapsed Time 3-38 Months
Mean Elapsed Time 486 Days
Median Elapsed Time 420 Days
Population 33,315
Males 16,825 (50%)
Females 16,440 (50%)
In many, although not all, cases the TTI Success Insights Style Insights assessment is utilized as
part of a training and/or coaching program. As such, there is an expectation of “improvement” in
scoring that will potentially degrade the overall test-retest correlation results, especially relevant
to smaller data samples. By improvement it is meant that during a coaching or training process,
aspects of individuals’ behavior, motivation, or other constructs being measured may be identified
as areas on which to work for a given reason, and improvement is meant to reflect a movement
towards a specific goal as measured by the variables underlying the Style Insights assessment. As
noted in an earlier section, specifically quoting [73], measures of temporal reliability are measures
of “persistent relative standing within the group rather than consistent absolute value of score”.
Scale D I S C
D 0.80
I 0.84
S 0.79
C 0.82
As such, it is to be expected that there may be some negative impact on the relative standing of
individuals when comparing those being coached and trained in improving the specifics of individual
as compared to those who may not be participating in such a program. At the time of writing this
manual, there is no distinguishing database element to determine whether an individual participated
in a coaching or training program, or whether an individual took the assessment for other purposes.
Table 8.2 contains the correlation results between the first and second administrations of the TTI
Success Insights Style Insights Insights assessment in the data from January 1, 2015 to March 25,
2020. The results are solidly in the 0.79 to 0.84 range with all correlations computed with statistical
significance at less than a 0.001 level. Based on the results of the correlation analysis, we conclude
that the TTI Success Insights Style Insights temporal consistency is clearly at an acceptable level.
62
Behaviors Technical Manual Temporal Consistency
63
Behaviors Technical Manual Generalizability
case the items, has the following formulation of an individual’s score on an item:
EM Sp 2 + n σ2
= σpi,e i p
EM Si 2 + n σ2
= σpi,e p i (9.5)
2
EM Spi,e = σpi,e
Note that the left hand side of (9.5) is completely known from the ANOVA calculations and the
right hand side is a triangular system that may be easily solved by substitution methods. Of course,
for systems with many facets it is much easier to use the power of modern computing systems using
matrix algebra. The elements σp2 , σi2 , and σpi,e
2 represent the unknown variances we are interested
in computing, and the values np and ni represent the number of individual respondents and the
number of items on the assessment respectively.
Now that we have estimates of the variances for the relevant pieces of the puzzle, we are interested
in what is called the δ-type error in the literature. The δ-type error is of primary concern when
researchers are interested in decisions that rank order individuals, such as the test-retest data from
Section 8. All sources of variation that involve individuals are removed from consideration and we
compute the following:
2
σpi,e
σδ2 = . (9.6)
ni
65
Behaviors Technical Manual Generalizability
σδ2 may be interpreted as the variance of the errors for relative decisions. We are now in a position
to define the generalizability coefficient in the sense that we may compute the ratio of the universal
score variance to the expected observed score variance, see [177].
σp2
Eρ2 = . (9.7)
σp2 + σδ2
(9.7) is analogous to the usual standard reliability coefficients such as α and ω as presented in
Section 7. In contrast to the relative error is the absolute error which is concerned when both rank
ordering and differences in averages may be relevant. In the one facet case, we are interested in
both the residual effect and the effect due to items. Hence, we compute the following:
2
2 σi2 σpi,e
σ∆ = + . (9.8)
ni ni
A final piece comes into play through the index of dependability, defined as
σp2
Φ= 2 . (9.9)
σp2 + σ∆
At this point, it is convenient to present a hypothetical example to help solidify the ideas presented
so far in this section. This example and data in Table 9.1 are pulled from [119]. The details of the
computations of the values for EM Sp , EM Si , and EM Spi,e are lengthy and are based on standard
ANOVA calculations. The data in Table 9.1 is a hypothetical example of five individuals with
scores on five distinct items.
Table 9.1: Hypothetical Example of 5 Individuals and a Single Facet
Items
Person 1 2 3 4 5
1 9 6 6 8 8
2 8 5 6 6 6
3 9 8 7 8 8
4 7 5 6 4 4
5 7 5 6 3 4
Based on the data provided in Table 9.1, the following values are computed.
EM Spi,e = 0.84
EM Si = 4.34 (9.10)
EM Sp = 8.74
The results in (9.10), substituted into (9.5), lead to the following values for the variances of interest.
2
σpi,e = 0.84
2 EM Si −EM Spi,e 4.34−0.84
σi = np = 5 = 0.70 (9.11)
EM Sp −EM Spi,e 8.74−0.84
σp2 = ni = 5 = 1.58
Continuing, we compute the σδ2 and σ∆
2 values, using (9.6) and (9.8), respectively.
66
Behaviors Technical Manual Generalizability
2
σpi,e 0.84
σδ2 = ni = 5 = 0.17
σi2 +σpi,e
2 (9.12)
2 0.70 0.84
σ∆ = ni = 5 + 5 = 0.31
Finally, we may compute the generalizability coefficient and dependability index using (9.7) and
(9.9), respectively.
σp2 1.58
Eρ2δ = σp +σδ2
2 = 1.58+0.17 = 0.90
σp2 (9.13)
1.58
Φ = σp2 +σ∆2 = 1.58+0.31 = 0.31
We now wish to extend these ideas to a situation in which we have more than one facet. The details
of an n-facet model are beyond the scope of this manual and the interested reader is referred to
any of [28], [58], or [177]. We present, again by example, a two-facet model and briefly discuss the
ideas behind extending the two-facet model to a three-facet model.
The following example is presented in [177] with a shortened, similar example provided in [119].
Consider a group of 20 student teachers preparing lesson plans that are to be graded by two different
raters. There are a total of four exercises for each of the 20 individuals and two raters to provide
scoring for each of the four exercises for each of the 20 individuals. A schematic style view of the
main idea behind the problem at hand is presented in Table 9.2.
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
In this case we now have two facets, namely the raters and the tasks with which each of the
individuals must interact. This implies that we must consider sources of variation in addition
to the individual and the item. We must extend the ideas of equation (9.4) to account for the
additional complexity. Equation (9.14) shows the “easy” formula as presented in [177], as well as
in the original source [58].
It is noted that the values µ, µα , and µαβ are all theoretically computed using an expected value.
For example, µp is computed as the expectation of Xptr with respect to t and r, µp = Et Er Xptr .
Similarly, µpt = Er Xptr . The remainder of the values are computed in a similar manner replacing
the indices in the fashion just described.
We now find ourselves in the situation of having seven separate mean values that represent potential
effects that are introduced during the entire assessment process. In this case, the assessment process
includes the scoring as generated by the raters, not just the taking of the assessment (performing
the tasks).
67
Behaviors Technical Manual Generalizability
EM Sp = σ 2 (ptr, e) + nr σpt
2 + n σ2 + n n σ2
t pr t r p
EM St = σ 2 (ptr, e) + np σtr
2 + n σ2 + n n σ2
r pt p r t
EM Sr = σ 2 (ptr, e) + np σtr
2 + n σ2 + n n σ2
t pr p t r
EM Spt = σ 2 (ptr, e) + nr σpt
2 (9.15)
EM Spr = σ 2 (ptr, e) + nt σpr
2
EM Sptr,e = 2
σ (ptr, e)
It is noted that (9.15) again defines an almost triangular system that is easily solved by using a
simple substitution method with the values on the left hand side of the equation known a priori
from the ANOVA procedure. We provide the analogous computations for the values of σδ2 and σ∆ 2
next.
2
σpt 2 2
σptr,e
σpr
σδ2 = nt + nr + nt nr
(9.16)
2 2 2
2 σt2 σr2 σpt σpr 2
σtr σptr,e
σ∆ = nt + nr + nt + nr + nt nr + nt nr
Finally, the generalizability coefficient and dependability index are defined in the same way as
previously, see (9.7) and (9.9), reproduced for convenience here.
σp2
Eρ2 = σp2 +σδ2
(9.17)
σp2
Φ = σp2 +σ∆2
The extension to higher facet models should be fairly obvious at this stage, although the details may
seem a bit overwhelming. However, if one notes the symmetry that is developing when considering
the defining equations for the one- and two-facet models, this complexity abates somewhat. As an
example, consider a three-facet model of the form (i×j ×k ×p), to use the notation of [58], and note
that we have made the simplifying assumption that our needs are fulfilled by a model of this type
68
Behaviors Technical Manual Generalizability
rather than the multitude of other possible three-facet models. Then consider the representative
EM S-equations for this model as presented in (9.18).
EM Sp = σ 2 (pijk, e) + ni σpjk
2 + n σ2 + n σ2 + n n σ2
j pik k pij i j pk
2
+ni nk σpj + nj nk σpi2 + n n n σ2
i j k p
EM Sptr,e = σ 2 (ptr, e)
The remainder of the coefficients may be filled in using symmetry and other properties that are
apparently related to a particular set of coefficients related to the binomial theorem, a statement
made without an offer of proof at this time. As an example, it is hypothesized that there are a
total of 15 defining equations for the three-facet model which is the sum of the coefficients of the
binomial (a + b)4 minus one. This equation holds for one and two facet models as well, but we only
assume the forms (i × j), (i × j × k), and (i × j × k × p). The authors make no claims on other forms
of the facet models that are also included in generalizability theory. It is likely that under certain
assumptions one may be able to formulate and prove a proposition via mathematical induction on
the number of facets.
As a final presentation of this portion of the section on G-theory, we formulate the test-retest
problem as a two-facet model. The schematic of such a formulation is presented in Table 9.3.
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
The table shows a generic scale of m items. As is discussed in Section 2, the TTI Success Insights
Style Insights assessment has four scales, each of which has 24 items. The index n represents the
number of respondents in the data set under consideration and the T ime 1 and T ime 2 labels for
the final facet represent the two separate administrations of the assessment.
9.2 One-Facet G-theory Model Applied to TTI Success Insights Style Insights
Test-Retest Data
This section extends the ideas of the usual test-retest correlation study to a one facet model using
generalizability theory. The idea here is to treat the “time” variable as a source of variance, in
addition to the variance present in the individual respondents. Given that the respondents’ scores
69
Behaviors Technical Manual Generalizability
are the item of interest, they are not included as a facet, hence the use of a one facet model, the
only facet being two separate administrations of the assessment.
D Scores
Person Time 1 Time 2
1 31 38
2 32 35
3 31 38
4 37 30
5 37 33
.. .. ..
. . .
n 41 41
Table 9.4 shows how the data for a test-retest sample may be formulated into the one-facet modeling
approach, see for example Table 9.1 in Section 9.1. The way to interpret the data in Table 9.4 is
that each row represents an individual respondent and each column the respondent’s score at Time
1 or Time 2. As an example, the final row, row n, shows that the last respondent listed in the
data set scores 41 on the D scale on both administrations of the assessment. We then compute the
variance table as shown in Table 9.5.
The first thing of note in Table 9.5 is the very small, positive component variance for the Time
facet, especially when compared to the Persons source of variability. This low component variance,
accounting for well under 1% of the total variance in this model, indicates that Time is not a major
source of variability for the Dominance scale. In this setting we are interested in determining
precisely whether Time is a major source of variation in the scoring of the scales of the TTI Success
Insights Style Insights assessment. It does not appear that this is the case for the Dominance scale.
The second main piece of information to take away from this analysis is that nearly 80% of the
variability in the model under consideration is accounted for by the Persons source of variation.
70
Behaviors Technical Manual Generalizability
This does imply that approximately 20% of the variation is unaccounted for. The combination
of the preceding information may be interpreted to show that more than 50% of the variation is
accounted for by Persons, yet a portion of the variation is not accounted for and further analysis
may be warranted.
Table 9.6 shows the results for the Influence scale. In this case we see that in excess of 83% of
the component variance is explained by the Persons component. Influence does show the largest
component variance due to time, yet this scale still only accounts for less than 1% of the total
component variance. This appears to be quite solid performance with less than 17% of the variance
not accounted for, as well as the majority of the variance accounted for attributed to the Persons
facet.
Table 9.6: Style Insights Influence Scale Test-Retest
One-Facet G-Theory Model 1/1/2015 to 3/25/2020
The Steadiness scale is presented in Table 9.7, again showing quite similar results to the Dominance
scale. In excess of 78% of the component variance is explained by the Persons component and a
small component variance for the Time facet. The Steadiness scale shows a similar percentage of
unaccounted for variance with an even lower percentage of variance attributed to the Time facet.
The similarities continue in the Compliance data shown in Table 9.8. We again see a small positive
variance for the time facet with more than 82% of the component variance accounted for in the
Persons component. The percentage of variance attributed to the Time facet for the Compliance
scale one of the lowest of the five scales at only 0.00%. This scale is once again in the same situation
as both the Influence scale with more than 80% of the variance attributed to the Persons facet with
approximately 26% unaccounted for.
component rather than the Time component, although there is a non-trivial amount of compo-
nent variance unaccounted for. It is worth noting once again that unaccounted for variance has
traditionally been labelled as being due to random effects, yet this is not the interpretation when
considering the implications of Generalizability theory. In fact, the unaccounted for variance is due
to effects that have not been identified for one reason or another, not random effects that cannot
be understood.
Finally, we present the generalizability and dependability coefficients developed in Section 9.1,
see (9.7) and (9.9). Note that from the two previously mentioned equations, the relationship to
computing the reliability and dependability coefficients is such that the main component of the
computation is the value of σp2 compared to the δ and ∆ errors presented in (9.6) and (9.8). The
results are such that the generalizability and dependability coefficients are very close to being the
same.
Table 9.9: Style Insights One-Facet Test-Retest Generalizability and Dependability
1/1/2015 to 3/25/2020
It is noted that the Generalizability coefficient may be thought of as relatively synonymous to the
standard reliability coefficient. It is not completely clear that there is a 1-1 relationship between
the scoring levels as presented in Table 7.2. However, it is safe to interpret the coefficients in Table
9.9 in a similar fashion. With all scores at or above 0.84, all scales on the TTI Success Insights
Style Insights assessment score quite well under the G-theory one facet model approach with Time
as the facet.
As a final note, with all generalizability and dependability coefficients at or above 0.84 for all scales,
the post-2015 version of the TTI Success Insights Style Insights assessment scales perform quite
well under the one facet G-theory modeling approach with Time as the facet under consideration.
results show that for all scales, time is not a factor in differences in test scores. The majority of
the residual variances align with Persons, which is as desired.
73
Behaviors Technical Manual EFA
On this topic of parsimony, [99] states what they call the Postulate of Parsimony as follows:
This stipulates that, given two or more equally compatible models for the given data,
the simpler model is believed to be true; in factor analysis, only the model involving
the minimum number of common factors is considered appropriate.
The authors would like to point to the similarities between this principal of parsimony and the
so-called Occam’s, razor which states that of two explanations that account for all the facts, the
simpler one is more likely to be the correct one. Thurstone proposed a Simple Structure Theory
that is discussed in [99] as simply Simple Structure that they define as
A special term referring to a factor structure with certain simple properties; some of
these properties include that a variable has factor loadings on as few common factors as
74
Behaviors Technical Manual EFA
possible, and that each common factor has significant loadings on some variables and
no loadings on others.
A presentation of Thurstone’s original ideas for simple structure are presented in [84] on pp. 97-98,
where the author provides the following quotes from Thurstone’s original work on the topic, see
[169] for the original citation.
If a reference frame can be found such that each test vector is contained in one or more
of the ... coordinate hyperplanes, then the combined configuration is called a simple
structure. (Thurstone 1947, pp. 328)
To provide more context for the above quote, the following is also presented on pp. 98 of [84].
Just as we take it for granted that the individual differences in visual acuity are not
involved in pitch discrimination, so we assume that in intellectual tasks some mental
or cortical functions are not involved in every task. This is the principle of “simple
structure” or “simple configuration” in the underlying order of a given set of attributes.
(Thurstone, 1947, pp. 58)
The authors of this manual understand that the reader unfamiliar with factor analysis may not
completely grasp the concepts being presented. Terms such as simple structure and postulate of
parsimony are discussed further in subsequent sections when certain matrices integral to the factor
analytic approach are presented. For now, we present a final topic related to historical development
of factor analysis that may be viewed as controversial but is highly important nonetheless.
The concept of factor score is also discussed in a subsequent section. For now, we define a factor
score simply to be a score derived from the results of a factor analysis based on a given set of
underlying data. This factor score is, in some sense, a solution to a (constrained) optimization
problem. The exact meaning of optimal is problem dependent and determined by one’s choice of
what one wishes to minimize. For now, we briefly discuss the concept of factor indeterminacy. In
a later section, once some notation is available to us, we present the topic in more depth.
Underlying the factor analysis problem is, generally, a linear system of equations in some form. The
problem of factor indeterminacy is related to the fact that the total number of factors (common and
unique) exceeds the number of items. This in turn means that a certain matrix in the definition of
the problem cannot be square, and hence cannot be inverted. The interpretation is that we have a
system of equations in which we have more unknowns than we have equations. This implies that
if a solution exists, then infinitely many solutions exist. The existence of an infinite number of
solutions to a problem does not imply one cannot work with the system. It does, however, mean
that one must establish further criteria for the existence of a specific solution. It also means that
there are varying levels of consistency among the solutions. This may be interpreted in the sense
that some solutions may have more determinacy than others, and the search is on for the best
solution, whatever that may mean.
The next section lays out the general factor analysis problem and discusses some of the issues
related to the problem. The section generally follows the first few chapters of [84], although other
authors are consulted.
3.3. There appears to be a tendency for individuals in both the assessment industry and in the
academy to over-generalize based on quite low levels of correlation. As discussed previously, one
way to think of correlation is as the linear variation accounted for between the two variables. In
this way, if we square the correlation coefficient we have an estimate of the variation accounted for
between two variables. Having said that, note that it takes a correlation of at least ρ = 0.7071068
in order to claim that 50% of the variation between two variables is accounted for.
Before discussing the geometric interpretation of correlation, we must first introduce some notions
of real analysis from mathematics. First, let a = (a1 , . . . , an ) denote an n-dimensional vector. We
usually write a ∈ Rn . The (usual notion of) length of the vector a is given by the Euclidean norm
which is defined to be
v
u n
uX
||a|| = t a2i . (10.1)
i=1
We may also talk about the scalar or dot product of two vectors. Let a, b ∈ Rn and define the
scalar product to be a · b where
n
X
a·b= ai bi . (10.2)
i=1
Note that when no confusion may arise, we write ab for a · b. We need to establish a relationship
between the vectors a, b ∈ Rn and the angle θ between them as is hinted at in Figure 10.1. If we
imagine the triangle that is formed by the three vectors a, b, and b − a we may write, using the
so-called Law of Cosines from elementary trigonometry, a relationship that will allow us to prove
the following.
y
a
b−a
θ
x
Proposition 10.1. For any two vectors a, b ∈ Rn , the angle θ between them is defined by the
relationship
a·b
cos(θ) = . (10.3)
||a||||b||
Proof. As alluded to above, we shall employ a very unused relationship from elementary trigonom-
etry thereby answering the age-old question of when will I ever use this in real life. So, here you
76
Behaviors Technical Manual EFA
go. The Law of Cosines states that the following relationship holds for the triangle presented in
Figure 10.1,
a·b
cos(θ) = , (10.7)
||a||||b||
the result claimed in (10.3) in the statement of the proposition.
We now turn our attention to some notions of statistics, such as the familiar mean and standard
deviation of random variables. Let X be a random variable and consider a random sample of X
of size n. If we denote by pi the probability of a random selection of item Xi , we may define the
expectation of the random variable X to be
n
X
µ(X) = E[X] = Xi pi . (10.8)
i=1
Now, the reader may not be familiar with this particular form of the expectation of a random
variable. In an elementary mathematics or statistics course, we usually assume in this setting that
the random variables are, in some sense, uniformly distributed and we then have pi = n1 for all i.
We then obtain the usual definition of mean or average as
n
1X
µ(X) = E[X] = Xi , (10.9)
n
i=1
The other main statistic that most should be familiar with is a measure of the scatter or variation
in a random sample X. We define the variance of X to be
n
X
σ 2 (X) = E[(X − µ)2 ] = (Xi − µ)2 pi . (10.11)
i=1
77
Behaviors Technical Manual EFA
n
1X
s2 = (Xi − µ)2 . (10.12)
n
i=1
1
Another possibility is to define pi = n−1 and we have the so-called unbiased estimator
n
1 X
s2 = (Xi − µ)2 . (10.13)
n−1
i=1
Adopting the usual convention that “pretty much everything may be considered normally dis-
tributed”, we may consider the standardized version of a random variable X defined as
X −µ
,x= (10.14)
σ
and we may note that µ(x) = 0 and σ(x) = 1. We state this in the following proposition.
X−µ
Proposition 10.2. Let X be a random variable with mean µ and variance σ 2 . Let Z = σ .
Then, we have
µ(Z) = 0
and (10.15)
σ(Z) = 1.
Proof. We use the fact that the expectation operator is linear, i.e., that E[αX + β] = αE[X] +
E[β] = αE[X] + β, where α, β are scalars. So,
E[Z] = E[ X−µ
σ ]
µ
= E[β] = σ1 E[X] − σ
(10.16)
µ µ
= σ − σ
= 0.
For the standard deviation of Z, it is convenient to first show that σ 2 (X) from (10.11) may be
written in an alternative form.
= E[X 2 ] − E[X]2 .
Now, using the final line of (10.17) and the results of (10.16) we have
X −µ 2
= E[( ) ] − 02
σ
78
Behaviors Technical Manual EFA
1
= σ2
E[X 2 − 2Xµ + E[X]2 ]
1
= σ2
(E[X 2 ] − 2E[X]µ + E[X]2 )
1
= σ2
({E[X 2 ] − E[X]2 } + {E[X]2 − 2E[X]µ + E[X]2 })
(10.18)
1
= σ2
(σ 2 + µ2 − 2µ2 + µ2 )
1
= σ2
∗ σ2
= 1.
A brief note on an earlier comment, i.e., that everything is essentially considered normally dis-
tributed. This is actually not the case. In fact, most distributions are not normal. However, we, as
a society, have a tendency to view things in “the average”. Many incorrect assumptions are made
about data and its usage, including incorrect interpretations of the Central Limit Theorem and the
Law of Large Numbers. The Central Limit Theorem says, essentially, that if we have a random
sample of size n taken from a population with mean µ and standard deviation σ, the limiting form
√
of Z = X̄−µ
σ n is the standard normal distribution. This is a statement about the mean value of
the random variable X, not about the distribution of X. The Law of Large Numbers simply says
that the more times we perform the same experiment, the closer to the actual expected value we
get. The idea here is that we do not have that all things are normally distributed. We do have
that methods that work for normal distributions may work for non-normally distributed data under
appropriate conditions.
Finally, we may look at how one data set or vector varies in relation to another. We have a natural
measure between two separate data sets analogous to the variance of a random variable, and it is
called covariation or covariance. We define
n
1X
σ(X, Y ) = E[(X − µX )(Y − µY )] = (X − µX )(Y − µY ). (10.19)
n
i=1
We have used the notation µZ to denote the mean of the random variable Z and m may be either n
or n − 1 based on whether one desires to use a biased or an unbiased estimate of σ 2 (Z). A natural
extension of (10.19) is to normalize it via the following approach
x·y
ρX,Y = = cos(θ), (10.23)
||x||||y||
where θ is the angle between x and y in Rn .
We finish the discussion of correlation/covariance and its geometric interpretation by bringing
into the mix the idea of statistical significance of a correlation computation and an application of
covariation to the usual linear regression approach to modeling often used in social science settings.
First, note that we often compute the statistical significance of a computation by using the so-called
Student’s test or Student’s t-test. We do so here again. In this case, we compute the t-score as
√
ρ N −2
t= p . (10.24)
1 − ρ2
The null hypothesis in this case states that the correlation coefficient is equal to zero. If the
hypothesis is false, then we may interpret this to mean that obtaining a correlation coefficient
greater than ρ is highly unlikely. We thus compute
Zt
p=2 f (x)dx, (10.25)
0
where t is the value computed in (10.24) and f (x) is the probability density function of the t-
distribution. For completeness,
− ν+1
Γ ν+1
2 x2 2
f (x) = √ ν
1+ , (10.26)
νπΓ 2 ν
where ν represents the degrees of freedom and Γ is the gamma-function
n
(Ŷi − Ȳ )2
P
i=1
R2 = n , (10.29)
(Yi − Ȳ )2
P
i=1
Variable Definition
X1 whether government should spend more money on schools
X2 whether government should spend more money to reduce unemployment
X3 whether government should control big business
X4 whether government should expedite desegregation through housing
X5 whether government sees to it that minorities get their respective quotas in jobs
X6 whether government should expand the headstart program
The authors do state that, in their opinions, the first factor F1 is responsible for the covariation in
the first three variables and deals with economic issues. The second factor F2 is responsible for the
covariation in the last three variables and is related to civil rights issues.
The idea is to follow the lead of that discussed in Section 10.2 and create standardized variables
Zj rather than considering the set of Xj . The reason for this is pointed out in [84] pp. 12 where he
discusses the fact that the variables Xj are measured in arbitrary units with an arbitrary origin.
Standardizing the variables then puts them all on the same scale with origin (mean) at zero with a
81
Behaviors Technical Manual EFA
unit variance, see Proposition 10.2. Once the variables are standardized, we may then form what
[84] calls the Full Classical Factor Model,
m
X
zji = ajk Fki + uj Yji . (10.31)
k=1
In (10.31) the term Fki is the k th common factor for the ith individual, ajk is the k th factor loading
for the j th variable, Yj is the unique factor for the j th variable and uj is its unique factor loading.
The combination ajk Fki for each of the m terms represents the contribution of the corresponding
factor to the linear combination, while uj Yj represents the residual error.
There are some traditional assumptions that need to be made clear. First, we assume that Fki
and Yj are standardized variables, i.e., each has zero mean and unit variance. Further, the unique
factors are uncorrelated with the common factors, usually denoted as
We interpret the terms in (10.34) as follows. The first line of the equation is the definition which is
expanded in terms of the right hand side of (10.31). The second line contains two sums of sums. The
parenthesized sums in each of the two outer sums are the definitions of the variance of the common
and unique factors, respectively. Recall that both common and unique factors are standardize
variables. The third line of (10.34) also contains two sums of sums. The internal parenthesized
sums represent the covariance of the common factor with each other and the common factors with
the unique factors. The latter is assumed to be zero. If we assume the former to also be zero, we
have the so-called orthogonal factor case and we may write (10.34) as
m
X
s2j = a2ji + u2j . (10.35)
i=1
In (10.35), aji is the contribution of Fi to the variable zj . We then define the total contribution of
the factor Fi to all variables to be
n
X
Vi = a2ji , (10.36)
j=1
82
Behaviors Technical Manual EFA
m
X
V = Vi (10.37)
i=1
to be the total contribution of all common factors to the total variance. Then, V /n is the ratio of
the total variance explained. Clearly, one may consider Vi /n to be the contribution to the total
variance explained of the ith common factor. The term
m
X
h2j = a2ji j = 1, . . . , n (10.38)
i=1
In (10.40), Sj and Ej are the specific and error factors respectively. We may then write uniqueness
as u2j = b2j + e2j . Additionally, we may write the total variance as sj = h2j + b2j + e2j .
The authors hope that at this point the reader has gained a feel for the approach taken to the basic
factor analysis problem. It is not the intention of this section to completely present all aspects of
factor analysis. Rather, we present a rough overview of the approach along with some tidbits of
information that may be found to be useful. As an example of such useful information may be
found in Table 10.2 where we have reproduced a similar table from [84]. The table synopsizes the
information presented throughout this section of the manual. Prior to presenting this table, one
more short definition and discussion is required. This presentation closely follows the discussion on
pp. 19 in [84]. In this discussion, the term reliability is defined as the complement of error variance.
In psychology, this systematic component of the variable (as distinguished from error
components) is usually measured by the correlation between two separate administra-
tions of the same test or parallel forms of the test.
83
Behaviors Technical Manual EFA
In more contemporary terms, one measures reliability in the form of correlations based on test-
retest or parallel forms reliability as defined, for example, by the APA in [73]. See also Section 7
earlier in this manual for other discussions on the topic of reliability. If we define reliability to be
the correlation between two different representations of the same variable, we then have zj and zJ
representing the two variables, and we define the correlation between them to be rjJ , following the
notation in [84].
Z = Af + U y, (10.41)
where we choose to follow the notation of [84] again. In this setting, we have defined Z to be the
complete set of N observations of each of the n variables. So, Z is an n × N matrix with entries
∗ = z /N . The choice to divide by N is for a matter of notational convenience during matrix
zij ij
manipulation. Similarly, we define F to be the m × N matrix of common factors, Y is the n × N
matrix of unique factors, and (A|U ) is the matrix A augmented with an n × n matrix containing
the values uj on the diagonal and zeros elsewhere.
If we denote by sji the correlation between the variable zj and the factor Fi , we may define
sji = rzj Fi and further define S = (sji ) to be the matrix of correlations between the standardized
variables and the common factors. The matrix A from (10.41) is called the factor pattern matrix
and the matrix S is called the factor structure matrix.
With the previous definitions at hand, we may derive a series of relationships between the factor
pattern (A) and the factor structure (S). Begin with the following basic pattern relationship
Z = AF. (10.42)
If we post-multiply (10.42) by F 0 (since F is not square, we must use the transpose) to obtain
ZF 0 = AF F 0 = S. (10.43)
Note that the far left hand side of (10.43) is merely the definition of correlation between Z and
F . It is also common to denote the matrix of common factor correlations by Φ = F F 0 and
hence, S = AΦ is another way to express this relationship. This implies, assuming that Φ is non-
singular, that A = SΦ−1 . Recall that, by definition, the matrix of observed correlations is given by
R = ZZ 0 . We may then represent the reproduced correlation matrix (i.e., the correlation matrix
with communalities on the main diagonal) as
84
Behaviors Technical Manual EFA
The objective of a factor problem is to account for the tests, or their intercorrelations,
in terms of a small number of derived variables, the smallest possible number that is
consistent with acceptable residual errors.
This quote is from Thurstone in his 1947 work “Multiple Factor Analysis”, see [169]. This approach
is the alternative offered to, and not to be confused with, maximizing the variance extracted as in
the principal factors approach. While we do not directly discuss this concept in this section, one
may consult Section 13.5 where we give a brief introduction to principal component analysis. One
may also consult any of a number of texts and articles on the topic, such as Chapter 8 in [84].
The basic idea behind principal factors or principal component analysis is a variable reduction
based on maximizing the proportion of variance accounted for by each factor, successively. By this
we mean that the first factor accounts for the maximum possible proportion of the variance that
can be extracted by a single factor. The second factor then accounts for the maximum possible
amount of the proportion of the variance that remains after the first factor has been extracted,
and so on. The minres approach, rather than being concerned with accounting for variance of
the original variables, attempts to reproduce the observed correlation matrix via the theoretically
based correlation matrix with as little error as possible. This is accomplished via minimizing the off-
diagonal sum of squares of the residuals between the observed and reproduced correlation matrices.
Recall that the reproduced correlation matrix is an approximation of the observed correlation
matrix with communalities on the diagonal.
This approach is described mathematically as follows. Recall that the factor problem may be written
as Z = Af +U y where A is the factor pattern or factor loading matrix. We may attempt to find the
best fit based on R̂ = AA0 where the matrix A is to be considered an unknown, R̂ is the reproduced
matrix, and R is the observed correlation matrix. We then employ the usual least squares approach
to either fitting R by R̂ + U 2 or fitting R − I by R̂ − H 2 where H 2 = I − U 2 = diag(AA0 ). In other
words, H 2 may be recognized as the diagonal matrix with communalities on the diagonal.
Utilizing R̂ + U 2 leads us to the principal component approach. Using R̂ + H 2 and constructing
an appropriate objective function leads to the minres solution. What is meant by an appropriate
objective function is that we first desire, as mentioned earlier, to minimize the off-diagonal sum of
squares of residuals between the observed and reproduced correlation matrices. Hence, we construct
the sum of the off-diagonal residuals first utlizing
n n−1 m
!2
X X X
f (A) = rjk − aji aki . (10.46)
k=j+1 j=1 i=1
It is noted in [84] that the function f (A) is dependent on the size of the correlation matrix. Hence,
rather than minimize a quantity that necessarily grows with the size of the correlation matrix,
we remove the dependency by scaling by the number of correlations to be computed. This new
objective function is called the root mean square deviation and is given by
s
2f (a)
g(A) = . (10.47)
n(n − 1)
85
Behaviors Technical Manual EFA
There is one further piece of the puzzle to discuss. We know from the earlier discussion, the
communalities are directly computed during this process, see (10.45) and the fact that we are
fitting R − I by R̂ − H 2 . By definition, communalities represent the amount of common variance
shared by the factors and a given variable. This implies that it is not enough to minimize (10.46)
or (10.47). Rather, we must delve into the realm of constrained optimization. Surely it must be
that a given communality h2j must be bounded above by unity and hence
m
X
h2j = a2ji ≤ 1. (10.48)
i=1
The authors do not intend to give a complete proof of the minimizer of (10.46) subject to the
constraints (10.48), but we do present some of the details given that they are both instructive
and constructive. The solution to this minimization problem is approached with the Gauss-Seidel
method. Gauss-Seidel is an iterative technique from numerical linear algebra that essentially de-
composes a matrix A into A = L+U where L is lower triangular an U is upper triangular with zeros
on the diagonal. The method then solves the problem Ax = b by the iteration Lxk+1 = b − U xk .
First, it is noted on pp. 177 of [84] that the basic factor theorem given in equation (2.50) on pp. 32
of [84] implies that displacements in a single row of A introduce linear functions of the displacements
in the reproduced correlation matrix. This further implies that the objective function is quadratic,
at most. This is convenient in that it dramatically reduces the computational time given the
computing power when this method was developed. Given modern computing power, this is really
not very important. However, the approach to solving the problems of the time is instructive to
the interested reader.
We first introduce a displacement in row j of A
Aj 7→ Aj + j = (aj1 + 1 , . . . , ajm + m ).
Next, we define a new variable bji = aji + i and compute
m
X
r̂jk = aki bji . (10.49)
i=1
n
m
2
P P
fj (A) = rjk − aki bji
k=1 i=1
k6=j
n
m
2 (10.50)
∗
P P
= rjk − aki i ,
k=1 i=1
k6=j
86
Behaviors Technical Manual EFA
Finally, we set the right hand side of (10.51) equal to zero and solve for the terms containing i .
After some rearranging, which is justified as all sums are finite, we obtain
Xn m
X X n
rjk akl = aki akl i . (10.52)
k=1 i=1 k=1
k6=j k6=j
Theorem 10.3. If the minimum of fj is attained at a point outside the constraint set defined by
(10.48), then a minimum of fj subject to (10.48) will be attained on the boundary so that (10.48)
may be replaced with
m
X
b2ji = 1 (10.55)
i=1
where we have replaced the a2ji factor loadings with b2ji as in (10.49) (since they are the minimization
variables).
Theorem 10.3 is essentially the idea behind any constrained optimization on a compact set. For the
reader unfamiliar with set theory, a simple way to understand compact sets is to keep in mind sets
of real numbers that are both closed and bounded. The constraint (10.48) defines a compact set.
The function (10.46) is clearly continuous. At the risk of over-simplification, this is essentially an
application of the extreme value theorem (EVT), although the actual proof uses Lagrange multiplier
theory rather than EVT. A discussion of the proof is provided in [84].
Given the discussion on the minres solution to the factor problem, it seems an appropriate time
to re-introduce some geometric concepts of factor problems. Using the notation from throughout
this section, we interpret the various pieces of Figure 10.2, in some sense, using the ideas of the
minres solution to the factor problem. The vector zj represents a response or an observation. This
observation is projected onto a seemingly arbitrary plane defined by the factors, orthonormal in
this case, F1 and F2 . The distance between the vector zj and its projection onto this plane, denoted
ẑj , is the error j . The coordinates of the vector ẑj in the basis defined by F1 and F2 are given by
the factor loadings and denoted aj1 and aj2 .
87
Behaviors Technical Manual EFA
The variables and factors represented in Figure 10.2 may be thought of as representing the various
entries of the matrices and objective function in the development of the solution to the minimum
residual problem. In this figure, the minres solution is seeking to find, simultaneously, a solution
zj
F1
j
aj1
ẑj Unit Circle
aj2
F2
which minimizes the objective function given in (10.46) subject to the constraints (10.48) by finding
that solution which produces the smallest residual errors j . This is not a completely accurate
comparison as the error j in Figure 10.2 is the distance from the vector zj and its projection onto
the F1 − F2 plane. The actual minres approach minimizes the sum of the squares of the off-diagonal
elements between the observed correlation matrix and the reproduced correlation matrix.
Z = P (F |Y ) = P F̃ (10.56)
where (F |Y ) is the common factor matrix augmented with the unique factor matrix. In theory, if
P is invertible, a unique solution for the factor scores exists and is given simply by F̃ = P −1 Z.
However, the entire point of factor analysis, as discussed in the first part of Section 10.5 is to
represent a test or its correlations with “few variables”. This implies that the matrix P will rarely
be square, the first requirement that P be invertible. This implies that the number of unknowns
exceed the number of equations, further implying the possibility of an infinite number of solutions.
This does not mean that factor scores cannot be computed. [80] provides the example of a criterion
that must be satisfied for an orthogonal solution
1
F F 0 = In , (10.57)
n
which is the usual notation for an orthogonal matrix, and the matrix F is of size n × N . Further,
indeterminacy is not equal across all solutions, and hence, we may attempt to define solutions which
are “better” than others based on criterion such as that just stated for orthogonal solutions. [80]
also notes that relationships between factors and external criteria become more determinate as the
factor scores become more determinate.
88
Behaviors Technical Manual EFA
While the main resource for the factor score discussion is the paper by Grice, see [80], the notation
is inconsistent, and there is a major typographic error in his equation (8). Well known author
William Revelle notes the typographic error and has a brief discussion of the various factor score
approaches that may be read at https://personality-project.org/r/html/factor.scores.
html. For now, we summarize Revelle’s discussion using his notation.
Essentially, the problem boils down to finding an appropriate solution to
F = WZ (10.58)
or, the score on factor j for person i may be written
N
X
Fji = wjk zik . (10.59)
k=1
The most simple form of a solution is to take W = A0 where A is the factor loading matrix from the
P in (10.56). This is what Revelle calls his components method. An historical solution was proposed
by Thurstone and is denoted W = R−1 A where R is the observed correlation matrix and A is the
factor loading matrix again. Harman proposes using W = A(A0 A)−1 . The so-called Bartlett method
uses W = U −2 A(A0 U −2 A)−1 where U is the diagonal matrix of uniquenesses. The Anderson method
is only applicable to orthogonal factor solutions and is given by W = U −2 A(A0 U −2 RU −2 A)−1/2 .
Finally we have the ten Berge method. First, define the matrix L = AΦ1/2 where Φ is the matrix of
factor correlations. Next, define the matrix C = R−1/2 L(L0 R−1 L)−1/2 . Now, set W = R−1/2 CΦ1/2 .
There is clearly a range of differences in the various solutions for the matrix W in (10.58). These
authors have no formulated opinion in favor of or against any of the methods. We present the
material here as part of the larger overall discussion related to factor analytic procedures. We
do want to stress an earlier remark based on information in [80]. Using factor scores should
produce better criterion relationships than using other forms of scoring methodologies. The more
determinate the factor scores are, the more likely one is to see more determinate relationships with
external criteria. The main reason for this is the simple fact that not all assessment items are
created equal, and the factor analytic approach allows for weighted scoring based on the relative
strength (factor loads) and the levels of correlation that exist between the data and the factors,
depending on the choice of method.
89
Behaviors Technical Manual Graded Response Model
X = T + E, (11.1)
where X is an individual’s observed score, i.e., something we can measure such as a test score,
T is the individual’s (unobservable) true score, and E is the error associated with the true score.
With these definitions, we may then define the classical version of test reliability by denoting the
variance in the true and observed scores as σT2 and σX
2 and defining
90
Behaviors Technical Manual Graded Response Model
σT2 σT2
ρXT = 2 = 2 , (11.2)
σX σT2 + σE
which may, in some sense, be interpreted as a signal to noise ratio.
One of the main issues with CTT and its application in more contemporary assessment/data
analysis is that CTT often over-simplifies a given situation in its approach. For example, many
classical test theory approaches rely on some form of dichotomization during the process of analysis.
Take for instance the classical definition of item difficulty. Item difficulty is often defined simply as
the percentage of incorrect responses to an item by the population under consideration. This is a
straightforward definition if the test item of interest is simply a yes/no or right/wrong or true/false
type question. However, more contemporary formulations rely on what may be termed graded
responses or responses on a Likert-type scale.
The idea of “dichotomizing” a variable is a simple one. However, it should be, at least, intuitively
clear that something must be lost during the process. As an example, take an assessment item
that is originally scored on a Likert-type scale with six possible graded choices. The first “bucket”
is denoted by 1 and represents the lowest level of agreement with the item, with each successive
bucket representing a slightly higher level of agreement until the final bucket is reached representing
the most agreement with the item.
From the CTT standpoint, we are generally forced to choose a dichotomization, usually by defining
any items ranked 1,2, or 3 to denote a negative response, and items ranked 4,5, or 6 to be posi-
tively endorsed. In this way, we have created a binary variable, 0 indicating generally unfavorable
response, and 1 indicating generally positive response. We provide a simple example to illustrate
the potential pitfalls and to motivate the remaining discussion.
As mentioned above, item difficulty may be defined simply as the percentage of negative responses
to a (dichotomous) item, the thought being the higher the percentage, the more difficult the item. It
is noted that this is not the only definition of item difficulty, nor is it the most appropriate. However,
it is the simplest and most useful for a simple explanation of the potential issues that accompany
dichotomization of a variable. Suppose we consider two separate response distributions, both with
100 respondents, on a six-point Likert-type assessment item. Suppose the response frequencies are
given in Table 11.1.
Subset 1 2 3 4 5 6
X 10 15 25 25 15 10
Y 35 10 5 5 10 35
A simple transformation takes the information from Table 11.1 and creates Table 11.2. Simply
summing the values in Table 11.2 for categories 1,2, and 3 provide on estimate of item difficulty for
the two distributions. A quick check shows that both the X and Y distributions show a difficulty
of 0.50.
91
Behaviors Technical Manual Graded Response Model
Subset 1 2 3 4 5 6
X 0.10 0.15 0.25 0.25 0.15 0.10
Y 0.35 0.10 0.05 0.05 0.10 0.35
However, it should be clear that the behavior of these two items is quite different in nature. The
first (X) is a distribution similar in nature to a normal distribution, while the second (Y) is a barbell
shaped distribution with the extremities dominating. This latter example is an example of an all
or nothing response pattern, either one loves the item or one hates it. The former exhibits, in some
sense, an ideal distribution with a majority of the distribution centered around the middle of the
response scale. Regardless, information appears to be lost through the process of dichotomizing a
variable.
Given the above described potential problem, it is the desire of item response theory (IRT) to
extend the ideas of CTT in a reasonable manner while providing a framework that both accounts
for classical test theory results where applicable, yet allows for the extension of these ideas to a
modern approach to the classical analysis. According to [133], IRT may be described as relating the
probability of a person responding to an item in a specific manner to the standing of that person
on the trait the item is measuring. This is a big jump on the theoretical side from attempting to
measure the error in true score, based on the original definition of an observed test score, to relating
a respondent’s probability of answering in a specific way to their “ability” level as measured by the
item.
Reading through [83] and [133], one gets a feel that there are slightly differing opinions of the
evolution of IRT over the years. For example, [133] traces the foundation of IRT to Fechner circa
1860 (see [7]) and Thurstone in 1925 (see [14], [168], and [175]). The mathematical roots of IRT
in psychology may be traced back to the work of Binet, Simon, and Terman from 1916 (see [7]
and [14]), while the formal basis of IRT is generally attributed to Lawley for his work in 1943 (see
[7], [104], and [175]). Lord (see [115]) formalized IRT as the extension of CTT (see [7]). Lord
([116]), Lord & Novick ([117]), and Birnbaum ([13]) helped establish the acceptance of IRT in the
psychological community. Finally, Rasch (see [139], [140], and [141]) provided the first specific class
of IRT models.
Compare the presentation of [83], which discusses more of the contemporary development of IRT
beginning with Lord (see [113], [114], and [115]), citing that Lord led the transition from CTT to
IRT from Gulliksen’s (see [81]) challenge to develop invariant item statistics for test development.
[83] also notes the development of the “item characteristic curve” (see [170]) and the use of the
term “latent trait” (see [105]). Finally, a suggestion to change the terms “item characteristic curve”
and “trait” to “item response function” and “ability” is provided by Lord (see [116]).
Note that it is not the intention of the authors of this chapter to suggest that there are any
inconsistencies in the telling of the storing of the development of IRT. Rather, it is to point out
the differing perspectives and to suggest both as important reading for the individual interested in
understanding the role and development of the theory. On one hand, [133] provides a somewhat
more motivation based approach, while on the other, [83] seems to present a more milestone based
perspective. Both are equally important in tracing out the history.
The majority of the presentation is still related to the mostly dichotomous approach to item analysis.
Even at this stage, we still define concepts such as difficulty in terms of a binary approach, we just
92
Behaviors Technical Manual Graded Response Model
now compute an item characteristic curve or item response function, depending on one’s preference
for the language used. However, it is this transition from CTT to IRT which marks the beginning
of the ability to take the leap from a dichotomously based analysis to polytomous response based
analysis.
93
Behaviors Technical Manual Graded Response Model
Note second that there is a point, call it x0 in R such that the distribution is concave up (the graph
“opens up”) on (−∞, x0 ) and concave down (the graph “opens down”) on (x0 , ∞). The point
(x0 , F (x0 )) is known as an inflection point for the function F (x). At this point, some interesting
things happen for a generic CDF.
F (x0 ) is the half way point between the minimum and maximum values of F (x) and, assuming the
function F (x) has a derivative F 0 (x), F 0 (x) attains its global maximum at (x0 ). The maximum is,
in fact, global given that F (x) is strictly monotonically increasing on R.
With the interpretation of IRT as relating probability to ability, we an infer for a generic CDF
that the range of F (x) is (0, 1), which implies that F (x0 ) = 0.50. In other settings, the CDF may
have various parameters that may come into play determining a probability other than 0.50 for this
relationship. For example, for the Rasch 3-PL model which is introduced later in this section,
1+c
F (x0 ) = ,
2
where c is a parameter in the 3-PL model representing a lower bound. This lower bound represents
the possibility of guessing, say in a multiple choice assessment format. Note that in our format, we
have implicitly assumed no guessing implying a value of c = 0.
This special point, the point x0 defining the inflection point of the CDF F (x), is known as the
item location. One main advantage of this formulation is that the item location and the individual
ability are now measured on the same scale or metric. This means that we may use the previous
interpretation of IRT, that of relating probability to ability, to now state that if an individual’s
ability is greater than the item location, the individual is more likely than not to respond “correctly”
or in a “positive” manner to the item. In other words, we have extended the definition of item
location or item difficulty from the CTT viewpoint to a modern version in the IRT framework.
Figure 11.2 is a continuation of Figure 11.1 with the vertical green line representing the point on the
x-axis x0 and the horizontal red line representing the value on the y-axis F (x0 ). The reality is that
we will not know x0 , but will know the value for F (x0 ). In this case, we need to find the value of x
corresponding to the equation F (x) = 0.50. Since F (x) is monotonically increasing, F is invertible,
and we may find x = F −1 (0.50) to be the value of x0 . More generically, we find the solution to the
94
Behaviors Technical Manual Graded Response Model
equation F (x) = y, where y is the midpoint between the lower and upper asymptotes of the CDF.
In most, if not all, cases this solution is likely to exist by invoking strict monotonicity of the CDF.
Another notion of interest in item analysis and IRT is that of item discrimination. Item discrim-
ination is a statistic that measures how well an item is able to differentiate individuals along the
trait continuum. In other words, given a small change in the input (i.e., the trait), how significant
or large is the change in the probability of giving a correct response or positive endorsement? The
larger the change in output given the same change in input, the more discriminating an item is.
The preceding description of item discrimination foreshadows one possible approach to computing
it. A natural computation to measure the changes previously discussed is to take the ratio of the
change in output to the change in input, or the slope of a line. In particular, we are interested
in the slopes of the lines tangent to the CDF. Moreover, we are interested in where this infinitely
large set of lines attains maximum slope.
As noted above, the derivative of the CDF attains a global maximum at the point x0 . Given the
usual interpretation of the slope of the line tangent to a curve being equal to the value of the
derivative to the function generating the curve at the point of tangency, we have our suspect for
the value of item discrimination. The value of item discrimination is, in fact, defined to be the
slope of the line tangent to the CDF curve at the point where the CDF attains a global maximum,
x0 . See Figure 11.3 for a visualization of the preceding discussion.
Similar to finding the value for which F (x) = 0.50, at least in this example, we now find our
item discrimination value to be the value of the slope of the line tangent to the CDF at x0 ,
or m = F 0 (F −1 (x0 )). For completeness of exposition and to provide the details for the reader
unfamiliar with the some of the mathematical constructs discussed above, we finish this portion of
the discussion with a presentation of the following theorems without proof.
Theorem 11.2. Let I be an interval and let f : I 7→ R. If f is monotone and f (I) (the image of
I under f ) is an interval, then f is continuous.
So, assuming that a function is monotonic and maps an interval to an interval, then the function
is continuous.
Theorem 11.3. Every continuous 1-1 real-valued function on an interval is strictly monotone.
Theorem 11.4. Every continuous 1-1 real-valued function on an interval has a continuous inverse.
It is this third theorem that we need to invoke for some of the previous statements. It is the
combination of the first and second theorems that provide for one possible approach to the proof
of the third.
Now that we have provided some of the motivation behind the development, we provide some
concrete examples in the dichotomous space prior to extending these concepts to the polytomous
space. We begin by noting that one of the first researchers to provide a specific class of IRT models
for the dichotomous case is Rasch in his 1960 work, see [139]. There are many equivalent ways to
write down this model, so we choose the following notation:
exp (θn − δi )
P r{Xni = 1} = , (11.3)
1 + exp (θn − δi )
where θn represents the ability level of the nth respondent and δi represents the location or difficulty
of the ith item of an assessment. The functional relationship represents the probability that the
nth individual chooses a “correct” answer on the ith item and is denoted by P r{Xni = 1}, but
may be more explicitly denoted P r{Xni = 1|θn }. The notation P r{Xni = 1|θn } makes explicit the
conditional probability’s dependence on respondent ability.
Figure 11.4 presents the Rasch 1-PL (one parameter logistic) model for three separate values of the
location or difficulty parameter. The alert reader will have noted that the blue curve in the figure is
identical to the blue curve presented in the earlier examples. The red curve is “easier” in the sense
that it requires a lower level of the assumed underlying latent trait to obtain the probability of 0.50
of obtaining the “correct” answer to the given item. The green curve is correspondingly harder for
96
Behaviors Technical Manual Graded Response Model
similar reasons. Note the implicit assumption in the Rasch 1-PL model the item discrimination is
fixed and equal to one.
Changing notation slightly, we introduce the concept of a discrimination parameter and present
the Rasch 2-PL model in the following form.
exp (a (x − δ))
f (x) = , (11.4)
1 + exp (a (x − δ))
where a now represents the discrimination parameter.
Figure 11.5 displays the Rasch 2-PL model for various values of the discrimination parameter a.
The a = 1 value corresponds to the 1-PL model. The green curve represents a more discriminating
response function, while the red curve shows a not very discriminating response function.
A final example is the Rasch 3-PL model which incorporates a “guessing” parameter that provides
for a lower asymptote other than zero. The usual example for this is when an individual is taking
a multiple choice assessment in which each item has four possible answers. In this case, one would
expect that the lower limiting behavior to be 0.25. In this case, we modify the 2-PL model from
(11.4) as follows:
(1 − c) exp (a (x − δ))
f (x) = c + . (11.5)
1 + exp (a (x − δ))
It is instructive to spend a moment exploring the formula in (11.5). Note that if we set c = 0 we
recover exactly the formula in (11.4) for the 2-PL model. Similarly, if we set c = 0 and a = 1 in
(11.5), we recover the formula in (11.3) for the 1-PL model. Figure 11.6 shows an example of the
3-PL model with c = 0.25.
There is a interesting and important pattern emerging here. We began with yes/no type questions
and defined certain statistics such as item difficulty and discrimination. These were only defined
on dichotomously scored items and provided limited information. We generalize in such a manner
that the interpretation of the model is slightly different, yet we still arrive at those very important
statistics, in some sense. We then generalize a bit more and still arrive at models, that when
simplified, are completely compatible with the previous versions.
97
Behaviors Technical Manual Graded Response Model
Note that there are other possible choices for the distribution function under consideration. The
Rasch series of models typically uses the logistic function, hence the descriptive names n-PL model
which is short for the n-parameter logistic model. In practice, there are only the 1, 2, and 3-PL
models. However, at least theoretically although not much used in practice, is the 4-PL model.
The 4-PL model incorporates an upper asymptote that is less than one. However, this model has
not gained much traction, at least not that the authors of this chapter are aware.
The Rasch models increase in complexity by considering varying numbers of parameters with various
roles. At each stage the earlier models are recoverable by judicious choice of parameter values, which
is a very appealing feature. However, the underlying assumption is that we must, in order to apply
the Rasch family of models to our problems, still dichotomize our data. We require a solution that
allows us to not lose the valuable information that is being presented to us in the ordered category
response assessment data we collect. This is where Samejima and her myriad contributions to the
field begin to emerge.
... the Rasch models have been built on the requirement of model parameter sepa-
rability, with the associated issues of sufficiency, specific objectivity, and fundamental
measurement. Any hypothesized response process that may have generated the data to
be modeled is essentially an afterthought.
In this setting, the term “specific objectivity” refers to the assumption that the underlying trait
being measured should be independent of any particular instrument through which it is measured.
It is interesting, in light of contemporary views on the importance of response processes, that
Rasch developed his theory with “any hypothesized response process” being an afterthought. In
comparison, Samejima’s approach is quite different and is summarized in the following statement,
also from [133], pp. 63:
98
Behaviors Technical Manual Graded Response Model
So, to provide a short summary, the response processes are an afterthought in the formulation of
the Rasch models, while they are paramount in Samejima’s framework. This is not to say that
the Rasch models have no use. In fact, as the reader shall see, Samejima essentially uses the
Rasch model framework to model the dichotomous boundary characteristic functions used to build
the polytomous item response theory models of her theory. It is interesting, however, to note
the fundamental difference in underlying assumptions that led to the development of quite similar
(mathematical) models in the end.
In much of the literature regarding the polytomous item response theory models, little mention is
made to specific models or specific response formats. The reader familiar with TTI Success Insights
assessments that utilize an ordered response model or Likert-type model should keep this response
format in mind while reading the remainder of this section. For the reader unfamiliar, TTI Success
Insights typically uses a format intended to measure level of agreement with a given statement. For
example, in measuring one’s agreement with how important the statement “Bold, daring” is related
to one’s behavior, we might ask a respondent to rate on a scale of 1 to 6 with 1 being anchored
with the statement “Not at all like me” and 6 the statement “Very much like me”.
During the forthcoming discussion, the authors of this chapter suggest that the reader keep the
preceding example in mind. As a foreshadowing, from this example, we would be considering six
categories with five boundary or decision points for the respondent.
There are two points of interest to make prior to presenting the IRT models of interest. First is
that we desire and have the ability to collect data in more than a yes/no or right/wrong format, for
example on a Likert-type response scale. Moreover, we wish to analyze this data in a way that does
not lose information. As demonstrated in an earlier example, much more information is contained
in a six category format versus a dichotomous response.
Further, Samejima (see [148], [149]) demonstrates the increase in information available when gath-
ering the information with more categories (polytomous), while Cohen (see [45]) shows that there
exists a systematic loss of information when dichotomizing continuous or categorical data with
many categories. With these concepts in mind, we present a condensed version of Samejima’s
graded response model theory.
Generally speaking, each response category of an item exerts a level of attraction on each individual
respondent. The level of attraction varies across the population of respondents based on their
individual level of the underlying trait being measured. An individual’s level of attraction in a
given category is contingent upon the respondent having been attracted to the previous category.
In other words, one is not immediately attracted to the level six category in our example without
first having considered and found each previous category attractive enough to at least consider the
next higher level.
Samejima defines what she called a processing function, denoted Mij (θ) which is defined to be
the probability of being attracted to category j over and above category j − 1. It is helpful to
make note that historically, Samejima used the notation Mig (θ). Given that the primary author
99
Behaviors Technical Manual Graded Response Model
The cumulative attraction is operationalized as Pi∗j (θ), where Pi∗j is defined as the prob-
ability of responding positively at a category boundary given all previous categories,
conditional on θ. ... the probability of responding in a category can then simply be
calculated as the probability of responding positively at a category boundary minus the
probability of responding positively at the next category boundary...
.
In terms of the Pi∗j , one may write this as
Prior to presenting the mathematical particulars for this model, it is instructive to summarize some
of the history as presented in [133]. GRM is the archetypical model in the Samejima framework
due to the following:
2. GRM is the polytomous manifestation of Thurstone’s method of successive intervals, see [35],
[62], and [144];
3. GRM relies on (Rasch) 2-PL model, see e.g., [54], [68], [69], and [151], to obtain cumulative
boundary functions Pi∗j ;
4. The functions Pi∗j could be built based on any appropriate mathematical function, see [147],
[150];
5. In practice, only 2PL and 2-parameter normal ogive have been used; both outlined in Same-
jima’s original work, see [146].
Following the lead of the 2-PL model given in (11.4) we define the category boundary response
function (CBRF) of the graded response model (GRM) to be
∗ exp ai x − βij
Pij (x) = , (11.7)
1 + exp ai x − βij
where the ai represent the discrimination parameter of the ith item and βij denotes the location of
the j th boundary for the ith item.
With the CBRF at hand, we can now go after the true desired quantity, the item characteristic
response function (ICRF). Recall that the ICRF measures the true probability of responding in a
given category as compared to the probability of crossing the boundary as in the CBRF. In order
to do so and utilize the definition provided in (11.6) we must first define two quantities of interest.
In order to compute Pi1 , we must compute Pi∗0 − Pi∗1 , yet we have no definition for Pi∗0 . One may
think of Pi∗0 as the probability that a individual responds to the item in the lower category or
higher. Given that a respondent is, by definition, responding to the item in the lowest category or
higher, we may simply define Pi∗0 = 1 and define its location to be −∞.
In a similar manner, in order to define Pin , where n is the total number of boundaries, we need to
compute Pi∗n − Pi∗n+1 and we have not defined Pi∗n+1 . However, the interpretation of Pi∗n+1 is for a
respondent to answer the item in a category higher than the highest available category. Hence, we
define Pi∗n+1 = 0 as the interpretation is only possible if this is the case. Also note that the location
for Pi∗n+1 is ∞.
Thus, we have the following ICRF for the GRM model in general, and specifically for the case of
Pi∗j as provided for in (11.7). Pi1 = Pi∗0 − Pi∗1 = 1 − Pi∗1 , Pin = Pi∗n − Pi∗n+1 = Pi∗n , and the remainder
are as defined in (11.6). To summarize,
∗
1 − P i1 if j = 0,
P ∗ if j = n,
P ij = (11.8)
i∗n
Pij − Pi∗j+1 otherwise.
There are a handful of convenient properties to mention at this point. The fact that the model
has only a single discrimination parameter per item implies that the shape of all the CBRF is the
101
Behaviors Technical Manual Graded Response Model
same and it is the location for each CBRF that varies. This is convenient in that the location for
each CBRF for the GRM model is solved in the same manner as previously. Since each CBRF is
a dichotomous model, and we piece together the cumulative probabilities to generate the ICRF, it
is tempting to look at the ICRF for the location. In fact, one should look no further than solving
the equation Pi∗j = 0.50 as described in an earlier section.
In contrast, the polytomous Rasch models have boundary locations where adjacent ICRF cross.
One nice result of this GRM property is that there are never boundary reversals, which may occur
in other models. Another nice property is that GRM models may have different discrimination
parameters for different items on the assessment in addition to multiple measures of location (dif-
ficulty).
As a final note, it is possible to consider a free response or continuous response model. This can be
accomplished by a limiting process in that we may consider the limit as the number of categories
approaches infinity. In this way, perhaps a respondent answers on a sliding scale with endpoints 0
and 10 (or 0 and 1, or 0 and 100). Rather than choosing an integer between 1 and 10, one simply
slides the ruler to an appropriate place on the continuum that is consistent with one’s viewpoint
on a given item. So, rather than answering 2 out of 10 on a categorical scale, one may slide to
2.10304657 out of 10. While Samejima explored this approach, it has not caught on as a popular
one in practice. This topic is beyond the scope of the current discussion, but may be the topic of
a future paper by the current authors.
∂Pi (θ)
Ii (θ) = . (11.10)
∂θ
102
Behaviors Technical Manual Graded Response Model
Each item on the assessment has an item information function. The test information function is
then defined in the natural way. For each latent trait we have
n
X
I(θ) = Ii (θ). (11.11)
i=1
In some applications, one may be interested in the standard error of estimation which is defined as
the square root of the reciprocal of the test information function
1
SE(θ) = p . (11.12)
I(θ)
Similar results hold for all the models presented earlier in this work. Bowing to brevity, we do not
present all of these results. The interested reader may consult [8] for a more complete treatment.
Boundary Characteristic Curves
For boundary levels between Likert Responses A, B, C, D, E, & F
0.9
0.7
1* 2* 3* 4* 5*
Prob.
0.5
0.3
0.1
−4 −3 −2 −1 0 1 2 3 4
Ability
Of particular interest for this work is the information coming from the graded response model of
the previous section. In this case we want to consider the boundary characteristic curves. In the
case of the TTI Success Insights assessments, we typically have a Likert-style scale consisting of six
possible categories as described in Section 11.3. Also noted in that section is that there exists five
boundaries for the six categories. Each boundary curve represents the probability of a response at
or above the lower category. As an example we plot a theoretical set of boundary characteristic
curves for one item for this type assessment.
Continuing to follow the development of the GRM model from Section 11.3, we define the item cat-
egory response characteristic curves as previously outlined. The following is a theoretical example
based on the boundary curves presented in Figure 11.7.
It is noted that the information presented in Figures 11.7 and 11.8 are theoretically “ideal” in some
sense. The curves are all equally spaced in both figures and the maximum values of probability of
selecting a category are all equal for the four interior categories in Figure 11.8. It is unlikely to
expect such high quality results in practice. We present them as examples to be used for comparison
purposes.
We now turn our attention to constructing the item and test information functions for the polyto-
mous response case, specifically for the GRM. Following [8], we note that the ultimate goal is to
determine the precision with which a test or assessment measures ability and this has traditionally
103
Behaviors Technical Manual Graded Response Model
A F
0.9
0.7
0.5
Prob.
B C D E
0.3
0.1
−4 −3 −2 −1 0 1 2 3 4
Ability
been measured via the test information function approach. It is also the case that the test infor-
mation is a function of the individual item information functions, which are in turn functions of
the category information functions. This may be seen as follows.
Note that we may consider the dichotomous case to be a simplification of the graded response
model with only two cases. This is clearly true given that we model each boundary with the same
functional approach as used in the dichotomous case, so the restriction is clear.
Define these two cases to be Pi1 = Pi and Pi2 = Qi = 1 − Pi and denote by Iij the amount of infor-
mation associated with the j th response category. We may then write the amount of information
that is contributed by a given response category as the product of the probability of responding in
that category and the score (information) one obtains given a response in that category. We may
then sum this over all categories (in this case there are two) to obtain the item information,
From this point, the extension is straightforward, and we obtain for the case of n categories
n
X
Ii (θ) = Iij (θ)Pij (θ). (11.13)
j=1
Relying on Samejima’s development, see [146] and [147], define the information function of an item
response category to be
Example IIF
1.2
0.8
Information
0.6
0.4
0.2
0
-4 -3 -2 -1 0 1 2 3 4
Ability
2
Pij0 (θ)
Iij (θ)Pij (θ) = 2 − Pij00 (θ). (11.16)
(Pij )
Finally, the item information for item i becomes
2
n P 0 (θ)
X ij
00
Ii (θ) = 2 − Pij (θ) . (11.17)
j=1
(P ij (θ))
As a final note on item information, it is possible to formulate the entire development of the item
information functions in terms of the boundary probabilities rather than the category probabilities.
If we use the definitions of the Pij in terms of the Pij∗ we are able to derive
2
n ∗0 (θ) − P ∗0 (θ)
Pij−1
X ij
Ii (θ) = ∗
Pij−1 (θ) − Pij∗ (θ)
j=1
Xn
00
(θ) − Pij00 (θ) .
− Pij−1
j=1
Of note, Samejima showed that the second summand in the previous equation vanishes identically,
see [146]. In that case, the computation in terms of boundary probabilities reduces to
2
n ∗0 (θ) − P ∗0 (θ)
Pij−1
X ij
Ii (θ) = ∗ . (11.18)
Pij−1 (θ) − Pij∗ (θ)
j=1
(Pi0 (θ))2
Ii (θ) = (11.19)
Pi (θ)Qi (θ)
105
Behaviors Technical Manual Graded Response Model
where we have suppressed the dependence on θ of the various functions. The final pieces are related
to parameter estimation followed by ability estimation for the individual respondent.
Example TIF
6
4
Information
0
-4 -3 -2 -1 0 1 2 3 4
Ability
Note that the authors of this manual have found little in the form of concrete statements as to what
represents an acceptable level of information. In most literature reviewed for the writing of this
manuscript the item information function is described as a measure of internal consistency reliability
across the spectrum of ability (difficult). It may then seem reasonable to apply the typical lower
boundaries for reliability coefficients as first presented in Table 7.2 in Section 7.2.1. Recall that
reliability coefficients with values in the range 0.70 ≤ α < 0.90 are considered to be Respectable
to Very Good. Values below 0.70 are Undesirable to Unacceptable and values above 0.90 may
be considered to represent a highly homogeneous scale and one may want to consider shortening
it. Note that we have used α to represent a generic reliability coefficient and not necessarily the
traditional α coefficient mentioned in Section 7.2.1.
An example is given in Figure 11.11 showing an item information function along with a horizontal
red line showing where the 0.70 value is achieved by the function. In this case, we see an item
information function that appears to be strongly reliable throughout a large portion of the ability
range, dropping off at the extremes of visible ability spectrum.
It seems a natural extension of the item information function to the test information function would
be to determine the range of the ability spectrum for which the test information function exceeds
0.70n where n is the number of items that make up the assessment. In the example case at hand,
we assumed there were 5 questions. Hence, we are interested in the portion of the ability range for
which the test function exceeds 0.70 ∗ 5 = 3.50. An example of this is shown in Figure 11.12.
106
Behaviors Technical Manual Graded Response Model
Example IIF
1.2
IIF
IIF = 0.70
0.8
Information
0.6
0.4
0.2
0
-4 -3 -2 -1 0 1 2 3 4
Ability
Example TIF
6
TIF
TIF = 3.5
4
Information
0
-4 -3 -2 -1 0 1 2 3 4
Ability
107
Behaviors Technical Manual Graded Response Model
are known, a scoring rubric may then be derived (using the same optimization technique, now with
ability rather than the model parameters as the optimization variable(s)).
This optimization with respect to the ability variable may be done on an individual basis, thereby
deriving a weighted scoring for the individual. It is not a difficult extension, then, to implement and
utilize this scoring metric, assuming stable model parameters over large samples of the intended
population. This is similar in nature to a factor analytic approach to scoring assessments in which
the factor loadings across the items and their factorization may be utilized. In this way, one is
lending more weight to the items deemed, by the underlying IRT or FA model, as more difficult
and presumably such an item that should carry more weight in the scoring algorithm.
108
Behaviors Technical Manual Thurstonian IRT Model
109
Behaviors Technical Manual Thurstonian IRT Model
ment format data. One may consult Section 11 of this manual for a discussion of the TTI Success
Insights approach to GRM as applied to likert-style versions of our DISC model for analytic pur-
poses. In that section, we use the difficulty and discrimination parameters of the GRM output to
establish grouping of items from the different scales that show similar patterns. This is an attempt
to minimize the bias that may be introduced in the assessment development process noted in an
earlier part of this section.
Another of the arguments against the forced-choice format is that the assessment taker is forced to
respond to the assessment item, whether they have an opinion or not. The likert-style format can
allow for a neutral option. If one is measuring opinion, say of service received at a fine restaurant,
then a neutral option may be justified. However, if one is attempting to measure an ability spectrum
using a psychometric assessment, perhaps the researcher requires the respondent to choose.
Additionally, the likert-style format allows the individual to be all things at all times, if they choose.
This is particularly true in the case where an assessment may be used in high stakes situations,
such as selection for hiring or promotion. If the assessment taker is aware of the stakes, they may
attempt to be exactly that, all things at all times. A forced-choice format does not allow for this
type response pattern. Perhaps an example is in order.
Suppose we are interested in measuring what is really important to an individual that helps guide
them in their lives. In other words, what is the main motivation behind the individual. Suppose
we ask the open-ended question
• recognition.
• money.
If we were to ask this in a forced-choice format, the individual is then asked to choose those
outcomes that are most important to them, then next most, and so on. In a likert-style format,
the individual may be able to rank several of these responses as all or nothing. Is it possible for
an individual to be equally motivated by intellectual pursuit and power or money? We provide an
anecdotal response to that, which is clearly a rare occurrence, but an occurrence nonetheless.
In August 2006, Grigori Perelman was awarded the Fields Medal for his contributions to the proof
in the affirmative of the Poincaré conjecture, a century old problem first proposed by Henri Poincaré
in 1904. Pereleman famously refused the medal as noted in a NY Times article from August 22,
2019. According to the NY Times article, Perelman met with Sir John M. Bald, president of the
International Mathematical Union where he rejected again the award. According to New Yorker
magazine in an article dated March 19, 2011, Perelman is quoted as saying
110
Behaviors Technical Manual Thurstonian IRT Model
He proposed to me three alternatives: accept and come; accept and don’t come, we will
send you the medal later; I don’t accept the prize. From the very beginning, I told him
I have chosen the third one ... [the prize] was completely irrelevant for me. Everybody
understood that if the proof is correct, then no other recognition is needed.
In 2010, Perelman was further awarded the Millennium Prize, a one million dollar award given for
the solution to one of the eight, at the time, major problems that existed when the Millennium
Prize Problems were first established by the Clay Institute in 2000. The original problems are the
Birch and Swinnerton-Dyer conjecture, the Hodge conjecture, existence and smoothness of Navier-
Stokes equations, P versus NP problem, the Poincaré conjecture, the Riemann hypothesis, and the
Yang-Mills existence and mass gap. To date, only Perelman’s proof of the Poincaré conjecture has
been awarded a Millennium Prize. According to [145], Perelman refused the award due to the lack
of fairness surrounding the award, noting that his own contribution was no greater than that of
Richard Hamilton. Further,
To put it short, the main reason is my disagreement with the organized mathematical
community. I don’t like their decisions, I consider them unjust.
This is an individual who, according to some accounts lived in his parents’ basement, who turned
down the highest award and largest monetary prize in the field of mathematics because “the proof
is enough” and for fairness. This is also an individual that is likely as high on the intellectual scale
as one can be. This example is intended to illustrate that the highest level of a scale is very difficult
to achieve and likely does not play well with other scales on the same assessment. This is also
meant to be an argument, albeit anecdotal in nature, that in some cases we need to force choices
on individuals to obtain true measures of scales.
The entire point established just now is completely moot if we lack the ability to measure evidence
of validity and reliability adequately for the forced-choice format assessment. That is the main
purpose of this section of this manual. For discussion, we present again some information from
Section 3.5.
One of the main reasons that CTT is not applicable to forced-choice format data is the lack of
existence of an inverse for the scale correlation or covariance matrix. In other words, many of the
techniques of CTT, such as (exploratory or confirmatory) factor analysis require the inversion of
a certain matrix during the solution for the parameters of the underlying nonlinear optimization
problem. Essentially, most regression analyses rely on the inversion of a matrix on some level. In
the case of many CTT techniques, this matrix is covariance matrix of the underlying assessment
response data. In the case of forced-choice assessment data, the covariance matrix is, by definition,
singular. This means that the matrix of covariation for a forced-choice assessment cannot be
inverted.
As presented in Section 3.5, the authors in [94] note the following related to the correlation between
scales on a forced-choice format assessment.
Only by deleting a variable or variables can strict dependencies be removed but it will
be appreciated that the variables left still have shared specific variance, etc., and so
problems of interpretation remain.
The authors in [94] are generally quite negative towards forced-rank data, but mostly due to the
the use of the data generated to justify the reliability and validity of those assessments. In the
abstract to [94] the authors at least state the following:
111
Behaviors Technical Manual Thurstonian IRT Model
This is not to say that ipsative tests have no utility but that the claims made for
their validity and reliability and their applicability to inter-individual comparisons are
misleading.
We note that this article was first published in the Journal of Occupational Psychology in 1988.
There have been some interesting developments in the years that lead to the writing of this manual.
112
Behaviors Technical Manual Thurstonian IRT Model
Letting tij denote the j th subjects continuous preference for object i, Thurstone proposed the
following transformation, yl∗ , of the unobserved preferences
/ {i, i0 }
0 if k ∈
alk = 1 if k = i . (12.6)
−1 if k = i0
where the element alk is in the lth row and k th column of A. We may now write the left hand side
of (12.5) using the following notation
∗ t
y = (A I) = At + e, (12.7)
e
where I is the appropriately sized identity matrix. As an example, consider the case of n = 4,
corresponding to a single frame on the TTI Success Insights Style Insights assessment, a forced-
choice frame with 4 items. In this case, the matrix A takes on the form
1 −1
0 0
1
0 −1 0
1 0 0 −1
A= . (12.8)
0 1 −1 0
0 1 0 −1
0 0 1 −1
According to [121], combining the definition of A and the definitions of ylj from (12.4), we may
write for any paired comparison pattern
113
Behaviors Technical Manual Thurstonian IRT Model
ñ
" # Z Z
\
Pr yl = ... φñ (y ∗ : Aµt , AΣt A0 + Ω)dy ∗ (12.9)
R
l=1
where φñ (·) denotes the ñ dimensional normal density function with limits of integration
(0, ∞) if yl = 1
Rl = . (12.10)
(−∞, 0) if yl = 0
Ignoring the error term in (12.7), we simply write
y ∗ = At (12.11)
and we move into the realm of the Thurstonian factor model by considering the latent utilities to
be a (linear) function of the underlying traits of interest (in our case, the D, I, S, and C traits). In
other words, following [30] write
t = µt + Λη + . (12.12)
This should come as no surprise as we noted in (12.5) that the latent utilities had a mean vector µt .
We are merely writing the latent utilities as a linear function as a “deviation” from their means in
some sense. If there are n means and d latent traits, then Λ is an n × d matrix of factor loadings,
η is a d× vector of common factors (latent traits), and is an n × 1 vector of unique factors.
Note that we make some assumptions that are common in a standard factor analytic approach.
The latent traits (η) are uncorrelated with the unique factors () and have means of zero. The
latent traits are freely correlated (their covariance matrix is denoted Φ), but variances are fixed to
one to be able to identify. The unique factors are uncorrelated implying their covariance matrix
(Ψ2 ) is diagonal. Finally, we assume that both the latent traits and unique factors are normally
distributed.
At this point, the authors in [30] suggest reparameterizing the above model using equations (12.11)
and (12.12). They give several reasons for this which we summarize in the following list.
1. The Thurstonian factor model presented measures second order factors, such as latent traits,
versus first order factors, such as latent utilities
2. The Thurstonian IRT model speeds up computations in the case of large models
3. The residual error variances of the latent responses (y ∗ ) are zero for the factor model, implying
latent trait estimates cannot be computed
4. Reparameterization of the factor model as a first order model produces nonzero residual error
variances, allowing the computation of latent trait estimates.
Ψ̆2 = AΨ2 A0 .
Note that if we set p to be the number of frames or blocks of the n items we have the following.
2. Ψ̆2 is the (pñ) × (pñ) covariance matrix of the unique pairwise errors
To put some perspective on the scope of the problem for the TTI Success Insights Style Insights
assessment, we have p = 24 blocks measuring d = 4 traits with n = 4 items per block. The matrix
shown in (12.14) covariance relationships of the unique pairwise errors for a single four-item, forced-
choice frame or block.
2
ψ1 + ψ22
ψ12 ψ12 + ψ32
2 2 2 2
2
ψ1 ψ1 ψ1 + ψ4
Ψ̆ =
2 2 2 2
(12.14)
−ψ2 ψ3 0 ψ2 + ψ3
−ψ 2 0 ψ2 ψ2 ψ2 + ψ2
2 4 2 2 4
0 ψ32 ψ42 −ψ32 ψ42 ψ32 + ψ42
In order to illustrate some of the structure, we consider a condensed example of an instrument
similar to the Style Insights assessment containing p = 3 blocks. In this case, we first illustrate, for
convenience, the case with 3 latent traits measured by 3 items per block. In this case, the matrix
A is a block diagonal matrix of size 9 × 9 where each block diagonal entry is given by
1 −1 0
1 0 −1 .
0 1 −1
We present the full matrix for this example to present the structure.
1 −1 0 0 0 0 0 0 0
1 0 −1 0 0 0 0 0 0
0 1 −1 0 0 0 0 0 0
0 0 0 1 −1 0 0 0 0
A=
0 0 0 1 0 −1 0 0 0
(12.15)
0 0 0 0 1 −1 0 0 0
0 0 0 0 0 0 1 −1 0
0 0 0 0 0 0 1 0 −1
0 0 0 0 0 0 0 1 −1
Continuing, the Λ matrix is a matrix of 3 × 3 matrices “stacked” vertically with each 3 × 3 “block”
of the form
λi 0 0
0 λi+1 0
0 0 λi+2
where in this example i ∈ {3j + 1|j = 0, 1, 2} and in general i ∈ {3j + 1|j = 0, 1, 2, . . . }. The final
Λ matrix for this example looks like
115
Behaviors Technical Manual Thurstonian IRT Model
λ1 0 0
0 λ2 0
0 0 λ3
λ4 0 0
Λ=
0 λ5 0
(12.16)
0 0 λ6
λ7 0 0
0 λ8 0
0 0 λ9
λ1 −λ2 0
λ1 0 λ3
0 λ2 −λ3
λ4 −λ5 0
Λ̆ =
λ4 0 λ6
(12.17)
0 λ5 −λ6
λ7 −λ8 0
λ7 0 λ9
0 λ8 −λ9
Note that Ψ2 is the diagonal matrix with ψi2 as the ith diagonal entry. We then compute Ψ̆2 =
AΨ2 A0 to obtain
ψ12 + ψ22
ψ12 ψ12 + ψ12
−ψ22 ψ32 ψ22 + ψ32
ψ42 + ψ52
0 0 0
Ψ̆2 =
ψ42 ψ42 + ψ62
0 0 0
(12.18)
0 0 0 −ψ52 ψ62 ψ52 + ψ62
0 0 0 0 0 0 ψ72 + ψ82
0 0 0 0 0 0 ψ72 ψ72 + ψ92
0 0 0 0 0 0 −ψ82 ψ92 ψ82 + ψ92
where we note two things. First, Ψ̆2 is a symmetric matrix and we have suppressed the upper half.
Second, we have displayed Ψ̆2 in a smaller font for sake of being able to display the entire matrix.
For completeness, we present, in as much as possible, the same computations for the case mentioned
earlier with a four-item, forced choice instrument with 3 blocks. In this case, A is an 18 × 12 block
diagonal matrix with the following block structure
1 −1
0 0
1
0 −1 0
1
0 0 −1 .
0
1 −1 0
0 1 0 −1
0 0 1 −1
It then follows that the matrix Λ̆ is an 18 × 4 matrix with “stacked” blocks of the form
116
Behaviors Technical Manual Thurstonian IRT Model
λi −λi+1
0 0
λi 0 −λi+2 0
λi 0 0 −λi+3
0 λi+1 −λi+2 0
0 λi+1 0 −λi+3
0 0 λi+2 −λi+3
where the subscripts in the λ symbols follow a rule similar to that discussed in the earlier example.
To put the scope of all this into a more complete picture, we have the following dimensions for the
various elements of the model for the TTI Success Insights Style Insights assessment.
1. There are p = 24 blocks of n = 4 forced-choice items each measuring d = 4 traits;
4(4−1)
2. This implies ñ = 2 = 6 preference comparisons per block;
3. This, in turn, implies that γ = −Aµt is a vector of (24 · 6) × 1, or 144 × 1 thresholds;
4. A is a 144 × 96 matrix, Λ is a 96 × 4, giving Λ̆ as a 144 × 4 matrix;
5. Similarly, Ψ2 is a 96 × 96 matrix and combined with the dimensions of A, we have that Ψ̆2
is a 144 × 144 matrix.
At this stage in the discussion of the Thurstonian IRT/SEM modeling approach, [30] goes into an
explanation of some more technical details not relevant to the current discussion. The intent in this
manual is to outline the approach and to provide results when appropriate, not present a complete
discussion of topics that are beyond the scope of this document. The interested reader may consult
the section on “Identification of Thurstonian IRT Models for Force-Choice Questionnaires” in [30],
among other references. Similarly, we have followed the information in [124] for the implementation
and estimation of this model to all TTI Success Insights forced-choice format assessment data.
The approach relies on an MPlus (Muethen & Muethen) SEM code generator provided by the
authors of [124] in a Microsoft Excel macro. The output is then analyzed using the R Statistical
Package in multiple ways. In many cases the software package Matlab is used to generate graph-
ics and scoring algorithms. Similarly, the software package Maple is also used for generation of
graphical output based on the results of the model. The next section describes some of the more
important output statistics and graphical displays that may be generated based on such output.
Based on the assumption that each item measures only a single trait, we may then consider the
item characteristic function (ICF) to be a function based only on comparing item i measuring trait
ηa to item k measuring ηb . We may thus simplify the result of (12.19) to
−γ + λ η − λ η
P r (yl = 1|ηa , ηb ) = Φ
l
q i a k b
. (12.20)
2 2
ψi + ψk
The estimation process described in [124] provides estimations of all parameters in (12.20) and we
thus have a relatively simple version of the item characteristic function for the comparison of any
two items (in a given frame) for a forced-choice assessment format in the form of a standard normal
distribution (as a function of the two latent variables, ηa and ηb ). This allows the researcher to
display the item characteristic functions as 2-dimensional surfaces embedded in R3 .
Figure 12.1: Generic Item Characteristic Surface for Thurstonian IRT Model
Figure 12.2: Generic Item Information Surface for Thurstonian IRT Model
119
Behaviors Technical Manual Criterion Validity Studies
If a test score is interpreted for a given use in a way that has not been validated, it
is incumbent on the user to justify the new interpretation for that use, providing a
rationale and collecting new evidence, if necessary.
There are at least two major takeaways from the previous statement. The first is more implicit in
that any use of assessment data in a way that has not been validated requires validation. Second
and more explicit is that the responsibility for providing evidence of such validation lies primarily
with the test user.
One of the purposes of this paper is to begin to establish a foundation of validity evidence for the TTI
Success Insights Talent Insights R assessment by combining demographic information, specifically
job description, with a classification algorithm. In this case, the authors choose to use a logistic
regression approach to assigning membership in a group. One may consult Section 3.3 above for a
more detailed discussion presenting some of the reasons for our choice.
In a previous study, [70] shows that the TTI Trimetrix R DNA Legacy variables may be used in
conjunction with a logistic regression classification approach to successfully differentiate serial en-
trepreneurs from a group randomly selected from the TTI database. This study is an example
demonstrating the relationship between TTI assessment derived variables (Behaviors, Motivators,
and DNA) with other variables (membership in a specific population within a larger population).
In April, 2018, TTI began offering the opportunity for our network of Value Added Associates
(VAA) to add demographics collection to their assessment links. Since that time a good deal of
demographic information has been collected and analyzed in various ways. The current study under
consideration asks whether the use of a classification technique in a manner similar to the previously
discussed paper, see [70], may be applied to subsets of the demographic data. In particular, this
study considers several subsets defined through the use of the “Job Description” demographic
category in the TTI demographics database.
This paper is organized as follows. The next two sections discuss some details of the data sets
under consideration including an internal consistency presentation. This is followed by a section
briefly discussing some of the details of logistic regression. A section discussing measurement of
the success or failure of the model is presented next, followed by a section presenting the results of
the study on several data sets. Finally, a summary and future work section wraps up the paper.
120
Behaviors Technical Manual Criterion Validity Studies
is called Motivation Insights and is based on a six factor model and is used to measure the inher-
ent motivation behind an individual’s actions. The scales measured are Theoretical, Utilitarian,
Aesthetic, Social, Individualistic, and Traditional.
An additional set of variables is considered during the analysis below. There are a set of 12 variables
derived from the DISC variables of the Style Insights assessment. They are jointly called the
Behavioral Hierarchy. This set contains the variables Urgency, Frequent Interaction with Others,
Organized Workplace, Analysis of Data, Competitiveness, Versatility, People Oriented, Frequent
Change, Customer Relations, Follow Up and Follow Through, and Following Policy.
Additionally, Motivation Insights generates a total of 12 separate scores that are collectively known
as Driving Forces. In the section containing the results below, the regressions consider the 12
Driving Forces R variables while analysis of the scales focuses on the 6 scales mentioned above.
There is no inconsistency in this approach. All information is generated by the same assessment
taker responses.
All data considered in this study has been collected between April 2, 2018 and September 14,
2018. The data sets are pulled from the TTI Internet Delivery Service R databases. A data set is
generated for each of the two assessments, along with a combined data set generated by individuals
taking the Talent Insights combined assessment.
Each of the three data sets comes with a series of demographic data categories that may or may
not be utilized in any given instance. For those data entries that contain non-empty demographic
information, a brief study of categorical breakdowns reveals a series of subsets of the larger data
sets that may be individually studied. Of primary interest for this study is the Job Description
category which contains a list of job descriptions as defined by O*NET. For more information
on O*NET one may consult the United States Department of Labor Employment and Training
Administration O*NET website at https://www.doleta.gov/programs/onet/eta_default.cfm.
Several job classifications are clearly chosen more frequently than others making them particularly
important to TTI Success Insights given their higher frequency in the current data sets.
Title Frequency
Sales Rep, Services, All Other 579
Accountants 270
Managers, All Other 265
Sales and Related, All Other 231
CEO 208
The highest frequency job descriptions chosen in the combined Talent Insights data set are sum-
marized in Table 13.1.
Ideally, it is preferable to have larger subsets of data to study. However, it is the opinion of these
authors that the frequencies presented in Table 13.1 are large enough to present an initial look
at these categories with an eye toward the future when much more data becomes available. The
purpose of this initial look is to establish relationships between TTI Success Insights assessment
variables and external variables, specifically Job Description. Future studies are planned for such
a time when more data are available and, preferably, access to other criteria such as a performance
metric are also available.
121
Behaviors Technical Manual Criterion Validity Studies
The data sets used in the study are then formed by taking the set of information of interest and
combining them with a randomly generated sample of an appropriate size. For this study, each
set of interest is combined with a random sample of size 1,000. No selection criteria are used in
addition to the requirement that the random subset comes from the larger Talent Insights data set
and that the random subset does not contain any individuals from the population of interest.
We shall use the naming convention as outlined in Table 13.2.
Table 13.2: Naming Convention
Before discussing some of the statistics related to measures of evidence of reliability and validity,
we present some basic information obtained from the demographics for completeness.
Table 13.3: Basic Information Sales1
Table 13.3 presents the Sales1 category and shows a set weighted toward males with approximately
69% of the assessment data coming from males. Additionally, the Ethnicity breakdown shows 472
Caucasians with 28 African Americans coming in as the second largest group in this data set. In the
Random data set, 767 respondents chose Caucasian with African Americans coming in as second
largest at 66.
Table 13.4: Basic Information Accountants
Table 13.4 presents the Accountant category and shows a set weighted toward females with approx-
imately 56% of the assessment data coming from women. Additionally, the Ethnicity breakdown
shows 208 Caucasians with 15 Hispanic or Latino coming in as the second largest group in this data
set. In the Random data set, 803 respondents chose Caucasian with African Americans coming in
as second largest at 55.
Table 13.5 presents the Manager category and shows a set weighted toward males with approxi-
mately 60% of the assessment data coming from men. Additionally, the Ethnicity breakdown shows
203 Caucasians with 20 Hispanic or Latino coming in as the second largest group in this data set.
In the Random data set, 774 respondents chose Caucasian with African Americans coming in as
second largest at 66.
122
Behaviors Technical Manual Criterion Validity Studies
As is seen in Table 13.6, the CEO category is quite heavily weighted toward males with approx-
imately 77% of the assessment data coming from males. Additionally, the Ethnicity breakdown
shows 184 Caucasians with 11 African Americans coming in as the second largest group in this data
set. In the Random data set, 767 respondents chose Caucasian with African Americans coming in
as second largest at 69.
As is seen in Table 13.7, the Sales2 category is heavily weighted toward males with approximately
63% of the assessment data coming from males. Additionally, the Ethnicity breakdown shows 191
Caucasians with 12 African Americans coming in as the second largest group in this data set. In the
Random data set, 784 respondents chose Caucasian with African Americans coming in as second
largest at 66.
and 0.15 to 0.50 for more encompassing scale. Generally speaking, the desired range for corrected
item-total correlation is 0.30 to 0.50. The reader should note that these ranges are guidelines, not
hard cutoff points.
Whether a measure is acceptable or not is often a matter of interpretation of the scale, its intended
purpose, and many other variables. A general rule of thumb is that items with very low item
correlation provide little information and those with too high become redundant. The respective
averages provide a summation of the overall effectiveness of the scale.
Range Interpretation
α ≥ 0.90 Shorten Scale
0.80 ≤ α < 0.90 Very Good
0.70 ≤ α < 0.80 Respectable
0.65 ≤ α < 0.70 Undesirable
α < 0.65 Unacceptable
Finally, there is a more sound foundation for interpreting the internal consistency measure, although
there are differing opinions in this arena as well. However, for the purposes of this study, the authors
use the guidelines presented in Table 13.8.
Table 13.9 shows the Style Insights internal consistency and correlations for the Sales1 data set. The
computations are performed on the target population. All internal consistency measures score in
the Very Good range. The average inter-item correlations are all acceptable according to the larger
124
Behaviors Technical Manual Criterion Validity Studies
range interpretation of 0.15 to 0.50, and the corrected item-total correlations fall nicely within a
good range. Table 13.10 displays the same information as Table 13.9 on the Motivation Insights
assessment data. In this case all data falls nicely within acceptable ranges for all three statistics
presented.
Tables 13.11 and 13.12 present the results for the Accountant data set. All variables from the Style
Insights show an acceptable level of consistency and correlation. The possible lone exception is
the C variable shows a borderline average inter item correlation. The Motivation Insights variables
show a solid performance on all metrics at hand, although two variables show corrected item total
correlation that may be slightly on the high side.
The data in Table 13.13 shows an average inter item correlation for the C scale in the Managers
data set for the Style Insights variables that is borderline low.
The remainder of the variables show solid scores for all variables and the C variable is in the Very
Good range for the internal consistency measure. Table 13.14 shows good average correlations
scores for both inter-item and corrected item-total correlation. Four of the six scales fall just under
the Very Good range, all at the 0.79 level on the internal consistency metric.
125
Behaviors Technical Manual Criterion Validity Studies
Tables 13.15 and 13.16 present the information for the Sales2 data set. These tables show arguably
the most consistent data across all variables and metrics of interest of the data considered.
The final data set containing the data on Chief Executives is the other data set in the conversation
for most consistently across all variables and metrics. The data from both data sets and all metrics
considered show the highest levels of inter-item correlation for the D and I scales along with solid
consistency across all other metrics for these variables and all metrics for the remaining variables.
126
Behaviors Technical Manual Criterion Validity Studies
exp (β0 + β1 x)
π(x) = (13.1)
1 + exp (β0 + β1 x)
where the notation exp denotes the usual exponential function with base e. The function π : D 7→
[0, 1] where D is an appropriate domain dependent on the explanatory variable x and [0, 1] is the
usual unit interval in R.
According to [89], there are two main reasons for choosing the logistic distribution in (13.1). First,
π is an extremely flexible and easily used function, and second, π lends itself to meaningful (clinical)
interpretation. To see the utility of the function π note the following transformation, called the
logit transformation.
π(x)
g(x) = ln . (13.2)
1 − π(x)
Note that with a little algebra, g(x) = β0 + β1 x. This is useful in that the logit transformation of
the logistic regression equation results in a linear expression with many of the desirable properties
of the usual linear regression model.
One important difference between linear and logistic regression is that the error, which expresses an
observation’s deviation from the conditional mean, is no longer assumed to be normally distributed.
Again following [89], we may express the value of the outcome variable given x as y = π(x) + .
In this formulation, may take on one of two possible values. If y = 1, then = 1 − π(x) with
probability π(x), and if y = 0 then = −π(x) with probability 1 − π(x). In summary, follows a
binomial distribution with probability given by the conditional mean π(x).
The importance of the preceding discussion is that we can now readily construct the likelihood
function of the above mentioned binomial distribution. For values of y = 1 given x the contribution
to the likelihood function is π(x) and the contribution for values of y = 0 given x the contribution
is 1 − π(x). Thus, for any observation xi , the contribution to the likelihood function is given by
127
Behaviors Technical Manual Criterion Validity Studies
There is one more step involved to obtain the desired result. In all parametric regression approaches,
there is an underlying optimization. This usually entails some form of differentiation. In the case
at hand, (13.4) now requires differentiation with respect to the parameters β and a solution of the
resulting equations. However, differentiation of products of functions is quite difficult compared
to differentiation of sums of functions. This leads to a heavy computational cost. Hence, it is
advantageous to construct the log-likelihood function by taking the logarithm of (13.4) and using
the appropriate properties of the logarithmic functions, namely that ln(f · g) = ln(f ) + ln(g) and
ln(f g ) = g ln(f ). Hence,
n
X
L(β) = {yi ln(π(xi ))
i=1 (13.5)
+ (1 − yi )(ln(1 − π(xi )))}.
exp(g2 (x))
P (y = 2|x) = , (13.8)
1 + exp(g1 (x)) + exp(g2 (x))
where
128
Behaviors Technical Manual Criterion Validity Studies
There is a similar derivation of the log likelihood function to that in (13.5) and a maximum likelihood
estimation process is used to find the coefficients (βij ).
The utility of the multinomial logistic regression technique is for a case similar to predicting the
likelihood of a student with a given set of characteristics to pass a given course with a particular
grade level. This process could also be useful in constructing a predictive model that would rank
a group of sales employees into two categories, one category representing high performers and the
other category representing low to average performers. The third category may be a random sample
of the general population for differentiation purposes.
129
Behaviors Technical Manual Criterion Validity Studies
Prediction outcome
p n
True False
Y
Positive Positive
actual
value
False True
N
Negative Negative
total P0 N0
of which are directly used in this paper and others are for information only.
First we define the true positive rate and the false positive rate. The true positive rate is the ratio
of correctly identified positives to the total number of actual positives. In terms of the confusion
matrix we have
TP
TPR = . (13.10)
P0
Similarly, the false positive rate is the total number of false positives divided by the total number
of negatives.
FP
FPR = . (13.11)
N0
Some quantities of interest may be accuracy, the total correct classifications divided by the sample
size:
TP + TN
Acc = , (13.12)
P0 + N0
precision, the true positives divided by the total number of positives, true and false:
TP
P re = , (13.13)
TP + FP
and a measure of overall fit called the F-measure
2
F = 1 1 . (13.14)
P re + Acc
Going back to the example mentioned before, we have a group of 100 total, and 25 we are interested
in identifying using some method. Suppose we have the following confusion matrix:
17 12
8 63
12
FPR = = 0.16, (13.16)
75
17 + 63
Acc = = 0.80, (13.17)
100
17
P re = = 0.59, (13.18)
17 + 12
and
2
F = 1 1 = .68. (13.19)
.59 + .8
We note that the above example is for a single paring of T P R with F P R. Each of these rates
has been computed for a single classification point, the examples I gave were 0.25 and 0.50 for the
fictitious data in Table 13.19. The previous quantities are not based on a real classification model.
To move to the concept of the ROC curve, we consider not just a single point, but all possible values
we could use as a judgment of how well the model performs at each possible value in the interval
[0, 1]. Since there are infinitely many possibilities, we take a reasonably sized discrete sample and
compute a confusion matrix for each value we choose.
For each confusion matrix, we may compute any or all of the aforementioned quantities. However,
we are particularly interested in the relationship between the true positive rate and the false positive
rate. We consider how the functional relationship behaves, TPR as a function of FPR by considering
these as ordered pairs of data (F P Ri , T P Ri ), i = 1, . . . , n, in the unit square in R2 .
Visualization of this data is done by plotting the ordered pairs of data with the F P Ri values along
the x-axis and the T P Ri values on the y-axis. This plot is known as the ROC curve. Figure 13.2
131
Behaviors Technical Manual Criterion Validity Studies
shows an example plot. The line separating the two shades of blue is the line y = x and represents
the ROC curve of a random selection process. In other words, if we randomly classified individuals
the line y = x represents the relationship between the FPR and TPR values.
Values of an ROC curve that lie above the line y = x denote a classification algorithm that
is superior to a random selection process with the reverse holding for points below the line. The
example in Figure 13.2 shows a solidly performing algorithm. It is desirable to have a quantification
of how much better the algorithm performs than the random approach. This value is given to us
by the area under the curve (AUC), meaning the area under the classification ROC curve.
The line y = x forms a right triangle with the x-axis and the vertical line x = 1. The area under
this curve on the interval [0, 1] is thus 0.50. If we can show the area under the ROC curve of
interest is greater than or equal to 0.50, we have shown that the classification algorithm in question
performs better than the random selection approach.
The solution to this problem is to integrate the function defining the ROC curve. In other words,
Z1
AU C = f (x)dx (13.20)
0
n
X
P C1 = w(1)i Xi (13.22)
i=1
where the weights w(1)i are chosen to maximize the ratio of the variance of P C1 to the total variation
subject to the following constraint
n
X
w(1)i = 1. (13.23)
i=1
We then compute the second new vector P C2 which is now orthogonal to P C1 which accounts for
the maximum amount of the remaining total variation not accounted for by P C1 . This process
continues until we have computed n new vectors that form an orthogonal basis for the space spanned
by the original set of n explanatory variables. At this point, the art in the phrase “art and science”
comes into play.
We choose the first m new variables P C1 , . . . , P Cm such that as much variation as possible is
accounted for in the following sense. When adding the next variable to the set to be considered no
longer adds a significant amount of variation accounted for, that next variable is not adding to the
overall information to be considered. Exactly what the cutoff should be is a matter of some trial
and error to see what works best in a given situation.
One positive takeaway from the PCA approach is that we now have a set of orthogonal variables
and the probability of overfitting the model is much smaller. A negative of this approach is that
the interpretation of the new variables is not necessarily straightforward.
As noted in a previous section, the ROC analysis is concerned with determining the relationship
between the true positive rate (TPR) and the false positive rate (FPR). Further, we wish to
determine whether the information taken from the assessment variables generates identification
rates better than a random sampling technique.
This may be done in at least two ways. The first is to visually inspect the ROC curve and its
relationship to the line y = x in the plane. The second is to compute the area under the ROC
134
Behaviors Technical Manual Criterion Validity Studies
curve and compare that to the area under the curve generated by the line y = x on the interval
[0, 1]. Recall that this area is equal to 0.50. Examples of both methods are presented in a previous
section of this paper. Additionally, many statistics are available to be compared across data sets
and to help determine additional information about the quality of the fit. Given that this paper
is interested in determining whether relationships exist between variables of interest, as opposed
to attempting to generate a fully predictive model and testing its accuracy, the authors choose to
present the Brier statistic, see, e.g., [138].
First presented is the area under the curve (AUC) statistic. Table B.141 shows the results of the
analysis applied to the CEO data set. The results are fairly consistent across the data subsets with
Behaviors and All DISC (Behaviors with Behavioral Hierarchy included) scoring nearly identically,
and showing a slight advantage over the scoring of the Driving Forces data subset. When considering
the full Talent Insights data set, we see an increase in AUC of nearly 9% (0.75/0.69-1).
The Brier statistics show very little in the way of differences between the models, with only a slight
change in the all variables version. The No PCA version loses a 0.01 to the earlier versions, while
the PCA version gains a 0.01 advantage.
Figure 13.5 presents the ROC curve compared to the random selection technique. The curve is
smooth throughout and clearly lies above the line y = x consistent with the values of AUC being
consistently above 0.50 as shown in Table B.141.
The authors would like to draw a comparison, again, to an earlier, similar study conducted on Serial
Entrepreneurs. In this case, we draw the comparison between % Correctly Classified as presented
in [70] and the AUC values in Table B.141. This comparison is presented in Table 13.22. Note that
the column Correct ID is a real number version of the % Correctly Classified as reported in the
original work.
There are a few things to make note of. First, there is no generalization between identification
problems based solely on the variables present. In fact, [70] considered data from a TriMetrix
135
Behaviors Technical Manual Criterion Validity Studies
Legacy DNA assessment, collected in 2010, which looks at the six motivation scales compared
to the 12 Driving Forces considered in this work, for which data was collected between April
and Spetember, 2018. Similarly, this work does not differentiate between Adapted and Natural
Behaviors while [70] considers them separately. The AUC scores presented in Table 13.22 are based
on the PCA column of Table B.141.
What the authors found striking were the similarities between the performance of a very simi-
lar modeling approach (logistic regression) on two separate and assumed distinct data sets. To
elaborate, both groups of individuals are self-identified as Serial Entrepreneurs or CEO. Serial En-
trepreneur is defined as one who has started more than one business with no reference to the success
or failure of any business started by the individual. CEO is an abbreviation of a job description in
the O*Net classification and is used to denote Chief Executive.
While one may expect some similarities to exist, there is by no means any requirement for any
specific qualities to be adhered to by any individual starting a business or by anyone advancing
through the ranks of a company to attain a C-level position.
136
Behaviors Technical Manual Criterion Validity Studies
When comparing the results presented in Table 13.22, it is interesting to see that the different sub-
sets considered perform so similarly. Additionally, the hierarchy is generally respected in that each
individual subcomponent of Talent Insights performs approximately the same across the subsets
and across the two different studies, and combining the data together provides a better result. This
final piece of information is critical.
A conclusion of [70] is that additional assessments provide additional information. The results
of the current study appear to support that conclusion and support continued evidence of the
relationship between Talent Insights variables and external variables, in this case Chief Executives
as self-reported and based on the O*Net job classification model.
As a final note, one may question the comparison of the AUC to the % Correctly Identified. To
address this, note the % Correctly Classified, as presented in [70], is based on the Confusion Matrix
as defined earlier in this paper. For any specific value x ∈ [0, 1] and a probability model, one
may define a Confusion Matrix. For the results in [70], the assumption was made that the ratio of
membership to non-membership in the set of interest was known and used to generate the Confusion
Matrix (50%).
The ROC analysis presented in the current study are based on using a discrete approximation of the
ROC curve, which is the functional relationship between the true positive rate and the false positive
rate. The AUC statistic is based on a discrete approximation of the ROC curve and generates an
approximation of the area under the true ROC curve. To be explicit, denote by f : R 7→ R any
continuous function defined on the real line. Then,
Zb
1
f (x)dx (13.24)
b−a
a
defines the average value of f over the interval [a, b] ⊆ R. If we let f be a function representing an
ROC curve, we then have
Z1 Z1
1
AU C = f (x)dx = f (x)dx. (13.25)
1−0
0 0
Given that the % Correctly Identified from [70] is simply a point on the ROC curve, we can conclude
that the AUC values discussed in this paper and the classification percentages from [70] may be
compared as presented above.
with the PCA variable reduction technique, shows strong improvement over using either DISC or
12 Driving Forces variables alone.
Third, the results in this paper confirm earlier results presented in [70] and highlight that the Talent
Insights assessment may be successfully and statistically significantly used to classify the groups
mentioned in the earlier study as well as the five new groups considered in the current work.
Future studies are scheduled to begin as data becomes available. As demographic data collection
grows, the number and scope of studies such as this one are anticipated to grow dramatically.
The AUC statistics for Sales1 are not as strong across all variables considered when compared to
the CEO sample. However, the results do show the same pattern as the CEO sample and that of
earlier work in [70]. All variables show performance that well exceeds that of random sampling
138
Behaviors Technical Manual Criterion Validity Studies
and combining the variables from both DISC and the 12 Driving Forces outperforms using either
assessment individually.
Similarly, the Brier statistics are all at an acceptable level, although not as strong as the CEO
sample. The visualization of the AUC comparison to a random sampling approach is displayed in
Figure 13.6. The main variables of interest from DISC are Natural Compliance with a negative
coefficient and Versatility from the Behavior Hierarchy. From Driving Forces, we see Selfless with a
negative coefficient. The three aforementioned variables form the best fit model for Talent Insights
and Sales1.
Tables 13.25 and 13.26 along with Figure 13.7 show the results for the data set called Accountant.
139
Behaviors Technical Manual Criterion Validity Studies
The results in Table 13.25 show that the DISC variables perform similarly to both the Sales1 and
CEO samples, while there is some drop off in the performance of the 12 Driving Forces in this
sample. However, all samples outperform the random sampling approach, and the combination of
DISC and 12 Driving Forces performs best.
The Brier statistics shown in Table 13.26 are similar to those in the CEO sample, showing slightly
better results than the Sales1 sample. The AUC representation is shown in Figure 13.7. The Ac-
countant sample has Adapted Steadiness and Compliance along with Following Policy, all strongly
positively related to the data. Driving Forces has Intellectual and Commanding, with a negative
coefficient, come through. The final model for all Talent Insights with no PCA has Following Policy,
Commanding with negative coefficient, and Objective come through as the best fit model.
Tables 13.27 and 13.28 along with Figure 13.8 show the results for the data set called Manager. The
results in Table 13.27 show that the DISC variables perform similarly to the Serial Entrepreneur
version of either Adapter or Natural DISC scales with some drop off in the performance of the 12
Driving Forces in this sample compared to the three earlier samples in this work. However, all
samples outperform the random sampling approach and the combination of DISC and 12 Driving
Forces performs best once again.
The Brier statistics shown in Table 13.28 are similar to those in the CEO sample, showing slightly
better results than the Sales1 sample.
140
Behaviors Technical Manual Criterion Validity Studies
The AUC representation is shown in Figure 13.8. The results of the regression analysis on this
particular data set struggled to identify across the population with all sets of variables. For ex-
ample, DISC only regression showed Natural Dominance come across as statistically significant.
For Driving Forces, Commanding comes through statistically significant. However, there is little
improvement shown when combining the data into a Talent Insights combined regression.
When combining all variables in the PCA approach, we see a very limited improvement in the
overall output. The conjecture we have is the job classification Manager is quite vague and can run
a quite wide spectrum of the population. A manager could be anything from a teen-age assistant
manager at a fast food chain restaurant to a middle-aged individual running a convenience store
and so on. The possibilities are many, and we believe that this is an obstacle to such a generically
defined subset being reasonably identified. Having said that, we still show better than random
identification with this model.
Tables 13.29 and 13.30 along with Figure 13.9 show the results for the data set called Sales2.
141
Behaviors Technical Manual Criterion Validity Studies
The results in Table 13.29 show that the DISC variables in all samples outperform the random
sampling approach, and the combination of DISC and 12 Driving Forces performs best.
The Brier statistics shown in Table 13.30 are similar to those in the CEO sample, showing slightly
better results than the Sales1 sample with the exception of the All Variables with no PCA category.
The AUC representation is shown in Figure 13.9.
The main takeaway from the information presented in this section is that all data sets for all subsets
of variables show two main results. First, in all cases, all subsets of variables have statistically
significant logistic regression fits that outperform a random sampling technique as shown by the
ROC curves and the AUC and Brier statistics. Second, in all cases, the combination of both DISC
variables and 12 Driving Forces variables consistently outperforms using either set of assessment
variables alone.
142
Behaviors Technical Manual Criterion Validity Studies
In some cases the performance is not dramatically improved. For example, the Sales2 data set
shows very similar AUC for 12 Driving Forces and All Variables. The difference is seen in selecting
the PCA version over the No PCA version and basing this decision on the improved Brier statistic.
In the Accountant, CEO, and Sales1 data sets, the improvement is more noticeable. Similarly, in
the Manager data set using the PCA approach shows stronger improvement.
143
Behaviors Technical Manual Response Process Validity
questions are asked again while connected to the EEG. The EEG measuring frontal lobe
responses in terms of gamma waves is compared with the assessment questions. Positive
responses proved one frontal lobe response in terms of gamma waves, negative or false
answers provide a different gamma response, and neutral questions provide a neutral
gamma response. Reading the responses then tells whether the respondent initially
responded with integrity. If so, the assessment is validated.
Detailing all protocols used for our internal response validation process is beyond the scope of this
manual. The interested reader may consult any of [21], [22], [23], [26], [27], [46], [47], or [51].
In general, the Gamma for Ipsative Validation using Electroencephalography (GIVE) process ac-
cesses asymmetric gamma wave bursts in the prefrontal cortex to validate the underlying subcon-
scious decisions behind these self-report responses at the very moment of decision-making. As
stated in the patent, until now no process has linked these specific types of self-reports to ac-
tual brain activity. The new process uses asymmetric wave analysis resulting from stimulus to
validate the underlying mental decisions behind these reported responses, at the very moment of
decision-making, thus exposing the true thoughts behind the responses and documenting potential
abnormalities between their pre-assessments and their actual brain activity. This process provides
evidence that an evoked emotionally laden response results in corresponding brain activity and
documents both the intensity of human emotional responses and the directionality of the response.
Figure 14.1 presents an example of frontal lobe gamma asymmetry with positive, neutral, and
negative responses. The orientation of the brain is facing forward such that the right hemisphere
is on the left side of the image. The red indicates an increase in gamma activity, the blue indicates
a decrease in activity, and the green indicates little or no activation. In addition to color depicting
a range of response intensity, the gamma burst location is also key to interpretation. A left frontal
lobe flare is an indication of acceptance or a positive response and a right-side flare demonstrates
avoidance or a negative response to a stimuli.
At the time of publication of this manual, due to the 2020 COVID-19 pandemic, the TTI Success
Insights EEG lab was shut down, thus limiting our EEG response data for the Style Insights
assessment. However, protocols are in place to measure not only the direction of the brain response,
but also the intensity of specific items. To accomplish this, each forced-choice frame is modified
into four Likert-style response items for participants to make a selection. The existence of these
protocols allow the TTI Success Insights research team to collect such data and make comparisons
of the brain findings with the respondent’s original Likert-style survey responses, as well as an
item-by-item comparison to our mathematical findings.
145
Behaviors Technical Manual Response Process Validity
This same protocol has already been employed on our Emotional Quotient assessment and the pilot
results are to be published in a forthcoming article. The study protocols are published in [26]. The
initial findings from our EQ EEG study found five emerging groupings of items that when combined
with mathematical data are lighting the way to continual improvement. Each study participant
responded to a subset of items from the TTI Success Insights EQ assessment on a five-point Likert-
style scale. The Likert-style scale is anchored with 1 - Strongly Disagree, 2 - Disagree, 3 - Neither
Agree nor Disagree, 4 - Agree, and 5 - Strongly Agree. The images that follow are captured at
one-eighth second intervals.
Figures 14.2 through 14.6 provide examples of each of the five groupings. In the first, we find an item
that has a match between the Likert-style response and their mental processing, which documents
a dominant left frontal lobe gamma activity and a match with their agreement rating. The second
example is quite different. While the respondent agrees strongly with the item, their brain is slower
to form a response, and, when it does, it shows a right side or aversion to the concept. This suggests
that the item may have a socially acceptable answer that is arrived at after contemplation by the
brain, but results in a very different answer. Both the answer provided and the brain activity in
the third example suggest confusion or a lack of discrimination. Example four is a good example
of what happens in the brain when confronted with double negative statements. While assessment
designers once thought these mental manipulations kept respondents on their toes, this research
suggests they are simply difficult to mentally process. The final example is interesting in that it
provides very little brain activity suggesting that the participant is not emotionally associating
with the item and, therefore, may not be providing additional factor information.
146
Behaviors Technical Manual Response Process Validity
147
Behaviors Technical Manual Consequential Validity
15.1 The Intended Purpose of TTI Success Insights’ Style Insights Assessment
TTI Success Insights DISC Universal Language Reference Manual states:
We live in a laboratory where we have the unique opportunity to learn about people.
By developing a strong command of the DISC language, you will:
By exposing “how” individuals interact, Style Insights may also assist employers in the selection
process. When hiring agents of an organization understand the unique behavioral characteristics
of each job, they can make better hiring decisions. When managers understand the behavioral
styles of each employee, they can tailor their management style and re-frame concepts in ways that
will better communicate to each employee. Therefore, Style Insights not only allows for greater
self-awareness regarding our own behaviors, but also creates opportunities for better job fits.
Ensuring that unintended consequences are evaluated is the responsibility of those mak-
ing the decision whether to use a particular test, ... (pp. 31)
148
Behaviors Technical Manual Consequential Validity
149
Behaviors Technical Manual Summary
150
Behaviors Technical Manual Summary
consider this to be solid evidence of temporal consistency. Our most recent test-retest study prior
to the current study presented in this work is documented in [24].
151
Behaviors Technical Manual Summary
We have noted on several occasions in this manuscript that several of the tools are not directly
applicable to our assessment data due to its forced-rank response format. We have argued that
while the methods may not be completely directly applicable to our data, they may be applied
to the individual scales, noting that the interpretation of the results is difficult. With this caveat
noted, we use exploratory factor analysis (EFA), item response theory (IRT), and multivariable
item response theory (MIRT) to aid our diagnostic approaches to assessment development and
improvement.
We do this in two ways. First, we analyze our assessment data by extracting the individual scale
responses from the forced-rank data. This is the part that the academics disapprove. We again
note that we do so for diagnostic purposes with an eye on the future. The second approach is
to utilize a data gathering system developed internal to TTI Success Insights allowing us to test
individual items. We gather the individual item data on a likert-style scale whose data is useful in
many of these statistical approaches.
Our approaches to EFA, IRT, and MIRT are presented in Sections 10, 11, and 12, respectively.
EFA and IRT are applied to both the forced-rank data for diagnostic purposes and the likert-style
data for testing and improvement purposes. The MIRT is currently in the form of a Thurstonian
IRT approach as outlined in [30]. This implementation is in its early stages, but preliminary results
are promising, albeit on small subsets of the overall assessment.
152
Behaviors Technical Manual Summary
The output is then analyzed by the neurology team and then discussed with the mathemati-
cal/analytic team where we compare notes. The protocols for this process have been published in
a paper in late 2018, see [26]. In this paper we describe several potential response patterns that
emerge and then we attempt to tie these results into the ideas of item difficulty and discrimination.
Some examples of this joint analysis come from items that the study population generally shows
agreement. In this setting, agreement is interpreted to mean that the brain images show agreement
with the study participants’ responses to an item or frame. In this case we expect to see relatively
solid discrimination parameters in an IRT model, or reasonable levels of item-total correlation in
the non-IRT cases, for a given item.
In other cases we see items that are intended to be reverse scored or keyed in which brain imaging
shows confusion while mathematically we see low discrimination scores and potentially oddly dis-
tributed population response distributions. Overall, the idea is to use the mathematical analysis
to identify the good (and not so good), while using the EEG generated brain images to identify
any causes for potentially under-performing items. The combination is proving to be a quite solid
approach to identifying information about our assessments and the individuals responding to our
assessments.
As a final note, TTI Success Insights is quite proud of our merging of neurology and mathematical
analysis for at least two reasons. The first is that we are unaware of any other assessment company
in the market who are conducting neurological based studies. We believe that the analysis being
conducted at TTI Success Insights is truly an industry leader at this time. The second reason is
that according to the APA,
Gathering this type of evidence is difficult because one cannot directly observe the
cognitive processes going on within people’s heads as they respond to test items. (pp.
76, [156])
While it is not completely certain we are witnessing the cognitive process directly, we are gathering
information on how the individual’s brain activity is changing before, during, and after seeing and
responding to an assessment item.
154
Behaviors Technical Manual Summary
TM
essentially the same study on the Style Insights assessment. We are quite excited to get back
into the lab, get these studies kicked-off, and see what knowledge we may gain.
There are three main areas of focus for the mathematical side of our research team: Exploratory
Factor Analysis (EFA), Item Response Theory (IRT), and multivariable IRT. It has been noted in
earlier in this manuscript that EFA and IRT are not directly applicable to the assessment format
at hand. The forced-rank format generates a covariance matrix structure that, by definition, is
singular. A singular matrix is not invertible, hence any attempt to model a forced-rank format
assessment is useless. It is possible to separate the constructs from each other in the data, thereby
removing the aforementioned singularity. However, as noted in [94], while this is mathematically
possible, the interpretation of any results are questionable. Removing the physical connection of
the constructs in the data does not remove their impact on each other. Hence, interpretation is
difficult at best.
We have noted earlier that we do, in fact, employ these tactics in our study of our Style Insights
assessment and its respondent data. The reader may be wondering why state this after just arguing
that interpretation is difficult at best. The first part of the answer to this very relevant question
is that while interpretation is difficult, it is not impossible and much valuable information may be
gained to advance our understanding of our assessments. The second part of the answer is that we
also continually test our assessment items, both currently in use items and potential replacements,
in a likert-style format. In that case, the EFA and IRT models are completely appropriate. It is
through this combination of approaches that we anticipate gaining the most information about how
our assessments are performing.
In addition to the information gain, we also are moving towards advancing our analytic method-
ologies for all our assessments from the mostly archaic classical test theory approach to the more
modern item response theory approach. In this way, we are attempting to modernize both our
understanding of our assessment reliability, in the form of item and test information functions, and
internal structure validity, via the item characteristic curves and their associated item discrimina-
tion parameters. Clearly this only holds in the likert-style format. Once again, we argue that solid
information may be obtained in this format and employed in the forced-rank format to strengthen
the forced-rank based assessments.
We also intend to advance our approach to the forced-rank format assessments in our suite by em-
ploying multivariable IRT models, such as the Thurtonian approach as outlined in [30], among many
others. According to this approach, one may use a combination of Thurstone’s Law of Comparative
Judgment, confirmatory factor analysis, and item response theory to generate a multivariable IRT
model for the forced-rank assessment data. While to date we have not generated a full Thurstonian
IRT model for our Style Insights assessment, early results based on small subsets of the assessment
are promising. As pointed out in [94], one of the criticisms of forced-rank data is that they are not
applicable for interpersonal comparisons. We plan to alleviate this criticism by employing a full
multivariable IRT model in the future.
One very important area of interest deserves special mention. That area is invariance in gen-
eral, but specifically invariance of internal structure. In some settings this may be thought of as
approaching the problem of disparate impact (Civil Rights Act of 1964, Title VII), also known
more informally as adverse impact in the business community. TTI Success Insights has histori-
cally measured adverse impact on a periodic basis for most of our assessments. The current ver-
sion is hosted at https://images.ttisi.com/wp-content/uploads/2019/07/02120329/TTISI_
AdverseImpactStudy_2019.pdf. Archived versions of previous studies may be requested from the
155
Behaviors Technical Manual Summary
TTI Success Insights research team. There is a scheduled updating of the differential effects associ-
ated with several of our assessments, due to be released Summer, 2021. As with many other areas
of analysis, it is our intent to move studies of assessment invariance to a more modern analytical
approach over the next few years. For example, we are in the early stages of assessing differential
item functioning of our likert-style format assessments using graded response models.
It is a tall order to lay out all possible paths that our research may take over the next several years.
The previous paragraphs outline a handful of the directions we plan to explore. Additionally, we
plan to continue our exploration of regression modeling of our assessment variables against any
external data we may be able to obtain, including expanding our use of the demographic data
discussed in this manuscript. We also are discussing the possibility of an extended look at our
overall population and potentially relevant subsets of our population to determine if our variables
may are indicating any significant changes based on pre and post COVID-19 pandemic data. An
early look at our Style Insights variables comparing a similar time period in 2019 and 2020 did not
indicate such changes. However, hindsight shows us that our study may have been premature given
the now yearlong (and growing) series of lockdowns, stay-at-home orders, job losses, and more.
A final short discussion around an area we have long been interested shall close out this summary.
Given the vast geographic area that is the United States, many questions have come up over
the years around ideas such as whether individuals in northeastern cities (New York, Boston,
Philadelphia, etc.) are really different than individuals from rural upper midwest (Iowa, Minnesota,
or the Dakotas) or west coast cities (Los Angeles, San Francisco, Seattle). Given that individuals
living in New York City are nearly as close to London as they are to Los Angeles (a quick Google
search shows nearly identical average nonstop flight times), it may be necessary to consider whether
language usage differences in the very different geographic locations causes any issues in assessment
interpretation.
We have outlined many ways in which the TTI Success Insights team is working to continue to
produce the highest quality assessments available in the market. We employ cutting-edge, neuro-
logical research studies. We apply the latest theoretically based techniques in the mathematical
and statistical analysis of our assessments. Finally, we have a team that is working diligently to
merge the two disciplines to form a difficult to beat combined understanding of our assessments
and the interpretations of them. The future is bright at TTI Success Insights.
156
Behaviors Technical Manual References
References
[1] Caralee Adams. Report questions value of entrance exams in predicting college suc-
cess. https://blogs.edweek.org/edweek/college_bound/2014/02/report_explores_
use_of_standardize_test_scores_in_predicting_student_success.html, 2014. Educa-
tion Week.
[2] Alan Agresti. Categorical Data Analysis, Third Edition. Wiley, New York, 2012.
[3] C. Angus. How accurate are personality tests. Scientific American, 10 2018.
[4] H.L. Ansbacher. Prescott Lecky’s concept of resistance and his personality. Journal of Clinical
Psychology, 37:791–794, 1965.
[5] American Educational Research Association, American Psychological Association, and Na-
tional Council on Measurement in Education, & Joint Committee on Standards for Educa-
tional and Psychological Testing (U.S.). Standards for educational and psychological testing,
2014.
[6] Elizabeth J. Austin, Ian J. Deary, Gavin J. Gibson, Murray J. McGregor, and J. Barry Dent.
Individual response spread in self-report scales: Personality correlations and consequences.
Personality and Individual Differences, 24(3):421–438, March 1998.
[7] Frank B. Baker. Item Response Theory: Parameter Estimation Techniques. Marcel Dekker,
New York, 1992.
[8] Frank B. Baker and Seock-Ho Kim. Item Response Theory Parameter Estimation Techniques.
Marcel Dekker, New York and Basel, 2nd, revised and extended, edition, 2004.
[10] Deborah L. Bandalos. Measurement Theory and Applications for the Social Sciences. The
Guilford Press, New York, 2018.
[11] N. Barger and L. Kirby. Introduction to Type and Change. CPP, Inc., Mountain View, CA,
2004.
[12] W. L. Bedwell, S. M. Fiore, and E. Salas. Developing the 21st century (and beyond) work-
force: A review of interpersonal skills and measurements strategies. https://atecentral.
net/downloads/221/Salas_Firoe_IPS%20measurement%20Final%20draft.pdf, 2011. Pa-
per prepared for the NRC Workshop on Assessing 21st Century Skills.
[13] A. Birnbaum. Some latent trait models and their use in inferring an examinee’s ability. In
F.M. Lord and M.R. Novick, editors, Statistical Theories of Mental Test Scores. Reading,
MA: Addison-Wesley, 1968.
[14] R.D. Bock. A brief history of item response theory. Educational Measurement: Issues and
Practice, 16(4):21–33, 1997.
[15] Douglas G. Bonett and Thomas A. Wright. Cronbach’s alpha reliability: Interval estimation,
hypothesis testing, and sample size planning. Journal of Organizational Behavior, 2014.
157
Behaviors Technical Manual References
[16] B. Bonnstetter. Farmer buying style study based on observable farmstead features. TTI
Success Insights, 1979.
[17] Bill J. Bonnstetter, Ronald J. Bonnstetter, Dustin Hebets, and Tom F. Collura. Validation
process for ipsative assessments, 2016. US Patent 9,060,702.
[18] Bill J. Bonnstetter, Ronald J. Bonnstetter, Dustin Hebets, and Tom F. Collura. Validation
process for ipsative assessments, 2016. Canadian Patent No. 2,808,691.
[19] R.J. Bonnstetter. Review of the Book Altered Traits: Science Reveals How Meditation
Changes Your Mind, Brain, and Body by D. Goleman and R.J. Davidson. NeuroRegulation,
5(3):103–104, 2018.
[20] R.J. Bonnstetter. Natural Versus Adapted DISC Graphs. White Paper, TTI Success Insights,
12 2020.
[21] Ronald J. Bonnstetter and Thomas F. Collura. Brain activation imaging in emotional decision
making and mental health: A review – part 1. Clinical EEG and Neuroscience, 2020.
[22] Ronald J. Bonnstetter and Thomas F. Collura. Brain activation imaging in emotional decision
making and mental health: A review – part 2. Clinical EEG and Neuroscience, 2020.
[23] Ronald J. Bonnstetter, Thomas F. Collura, and Dustin Hebets. Uncovering the Belief Behind
the Action. NeuroConnections, 2012.
[24] Ronald J. Bonnstetter and Eric T. Gehrig. TTI Success Insights Style Insights 2016 Temporal
Consistency Report. Technical report, TTI Success Insights, 4 2016.
[25] Ronald J. Bonnstetter and Eric T. Gehrig. TTI Success Insights Style Insights 2020 Technical
Manual Version 1.0. Technical report, TTI Success Insights, 2020. Forthcoming.
[26] Ronald J. Bonnstetter, Eric T. Gehrig, and Dustin Hebets. Response Process Validation
Protocol using Neurophenomenological Gamma Asymmetry. NeuroRegulation, 2018.
[27] Ronald J. Bonnstetter, Dustin Hebets, and Nancy L. Wigton. Frontal gamma asymmetry in
response to soft skills stimuli: A pilot study. NeuroRegulation, 2(2):70–85, 2020.
[28] Robert L. Brennan. Generalizability Theory. Statistics for Social Science and Public Policy.
Springer, New York, 2001.
[29] Anna Brown. Doing less but getting more: Improving forced-choice measures with item
response theory. Assessment and Development Matters, 2(1):21–25, 2010.
[30] Anna Brown and Albert Maydeu-Olivares. Item response modeling of forced-choice question-
naires. Educational and Psychological Measurements, 71(3):460–502, 2011.
[31] Anna Brown and Albert Maydeu-Olivares. Fitting a Thurstonian IRT model to forced-choice
data. Behavioral Research Methods, 44:1135–1147, 2012.
[32] Anna Brown and Albert Maydeu-Olivares. How IRT can solve problems of ipsative data in
forced-choice questionnaires. Psychological Methods, 18:36–52, 2013.
[33] Anna Brown and Alberto Maydeu-Olivares. Ordinal factor analysis of graded-preference
questionnaire data. Structural Equation Modeling: A Multidisciplinary Journal, 25(4):516–
529, 2018.
158
Behaviors Technical Manual References
[34] W. Brown. Some experimental results in the correlation of mental abilities. British Journal
of Psychology, 3(3):296–322, 1910.
[35] R.H. Burros. The estimation of the discriminal dispersion in the method of successive inter-
vals. Psychometrika, 20:299–305, 1955.
[36] J.M. Cattel. Mental tests and measurement. Mind, 15:373–380, 1890.
[37] W.F. Chaplin and K.E. Buckner. Self-ratings of personality: A naturalistic comparison of
normative, ipsative, and idiothetic standards. Journal of Personality, 56:509–530, 1988.
[38] Eunseong Cho and Seonghoon Kim. Cronbach’s coefficient alpha: Well known but poorly
understood. Organizational Research Methods, 18(2):207–230, April 2015.
[39] Lee Anna Clark and David Watson. Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7(3):309–319, 1995.
[40] W.V. Clarke. The construction of an industrial selection personality test. Journal of Psy-
chology, 1:379–394, 1956.
[41] W.V. Clarke. The personality profiles of life insurance agents. Journal of Psychology, 42:295–
302, 1956.
[42] W.V. Clarke. The personality profiles of loan office managers. Journal of Psychology, 41:405–
412, 1956.
[43] W.V. Clarke. The personality profiles of self-made company presidents. Journal of Psychology,
41:413–418, 1956.
[44] W.V. Clarke. The personality profiles of self-made company presidents. Journal of Psychology,
41:413–418, 1956.
[45] J. Cohen. The cost of dichotomization. Applied Psychological Measurement, 7:249–253, 1983.
[46] T. F. Collura, C. Zalaquett, and R. J. Bonnstetter. Seeing inside the client’s mind. Counseling
Today, 57(6):24–27, 2014.
[48] T.F. Collura, R.J. Bonnstetter, D. Hebets, and B.J. Bonnstetter. Uncovering the Belief
Behind the Action. NeuroConnections, Winter:20–23, 2012.
[49] T.F. Collura, R.J. Bonnstetter, and C.P. Zalaquett. Neurocounseling: Bridging Brain and
Behavior. Counseling Today, 57(6):24–27, 2014.
[50] T.F. Collura, R.J. Bonnstetter, C.P. Zalaquett, and S.J. Chatters. Toward an Operational
Model of Decision Making, Emotional Regulation, and Mental Health Impact. Advances in
Mind-Body Medicine, 28(4):4–19, 2014.
[51] T.F. Collura, N.L. Wigton, C. Zalaquett, S. Chatters, and R.J. Bonnstetter. The Value
of EEG-based Electromagnetic Tomographic Analysis in Human Performance and Mental
Health. Biofeedback, 44(2), 2016.
159
Behaviors Technical Manual References
[52] Thomas F. Collura, Nancy L. Wigton, Carlos Zalaquett, SeriaShia Chatters-Smith, and
Ronald J. Bonnstetter. The Value of EEG-Based Electromagnetic Tomographic Analysis
in Human Performance and Mental Health. Biofeedback, 44(2):58–65, 2016.
[53] International Test Commision. The ITC Guidelines for Translating and Adapting Tests
(Second Edition). https://www.InTestCom.org, 2010.
[54] D.J. Cooke and C. Michie. An item response theory analysis of the hare psychopathy checklist-
revised. Psychological Assessment, 9:3–14, 1997.
[56] Lee J. Cronbach. Coefficient alpha and the internal structure of tests. Psychometrika, 16:297–
334, 1951.
[57] Lee J. Cronbach. My current thoughts on coefficient alpha and successor procedures. Center
for the Study of Evaluation, National Center for Research on Evaluation, Standards, and
Student Testing, Graduate School of Education, University of California, Los Angeles, 2004.
[58] Lee J. Cronbach, Goldine C. Glesser, Harinder Nanda, and Nageswari Rajaratnam. The
Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles.
John Wiley & Sons, Inc., New York, 1972.
[59] Robert F. DeVellis. Scale Development: Theory and Applications. Sage, Thousand Oaks,
CA, 4th edition, 2017.
[60] William R. Dillon and Matthew Goldstein. Multivariate Analysis: Methods and Applications.
Wiley Series in Probability and Mathematical Statistics, 1984.
[61] Thomas J. Dunn, Thom Baguley, and Vivienne Brunsden. From alpha to omega: A practi-
cal solution to the pervasive problem of internal consistency estimation. British Journal of
Psychology, 105(3):399–412, 2014.
[62] A.L. Edwards and L.L. Thurstone. An internal consistency check for scale values determined
but the method of successive intervals. Psychometrika, 17:169–180, 1955.
[63] R. Eveleth. The myers-briggs personality test is pretty much worthless. Smithsonian Maga-
zine, 3 2013.
[64] Carl F. Falk and Victoria Svalei. The relationship between unstandardized and standard-
ized alpha, true reliability, and the underlying measurement model. Journal of Personality
Assessment, 93(5):445–453, 2011.
[65] Xitao Fan, Brent C. Miller, Kyung-Eun Park, Bryan W. Winward, Mathew Christiansen,
Harold D. Grotevant, and Robert H. Tai. An exploratory study about inaccuracy and inva-
lidity in adolescent self-report surveys. Field Methods, 18(3):223–244, 2006.
[66] Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27:861–874, 2006.
[67] Ronald A. Fisher. Theory of Statistical Estimation. John Wiley & Sons, New York, 1950.
Proceedings of the Cambridge Philosophical Society 22 (1925) 700-725.
160
Behaviors Technical Manual References
[68] W.P. Flannery and S.P. Reise adn K.F. Widaman. An item response theory analysis of the
general and academic scales of the Self-Description Questionnaire II. Journal of Research in
Personality, 26:168–188, 1995.
[69] R.C. Fraley, N.G. Waller, and K.A. Brennan. An item response theory analysis of self-report
measures of adult attachment. Journal of Personality and Social Psychology, 78:350–365,
2000.
[70] Eric T. Gehrig. Classification of Serial Entrepreneurs via Logistic Regression: A Case Study.
White Paper, October 2017.
[71] J.G. Geier and D.E. Downey. Energetics of Personality: Defining a Self. Aristos Publishing
House/Geier Learning Systems, United States, 1989.
[72] J.G. Geier and D.E. Downey. Personality Analysis. Aristos Publishing House, United States,
1989.
[73] Kurt F. Geisinger. Reliability. In Kurt F. Geisinger, Bruce A. Bracken, Janet F. Carlson, Jo-
Ida C. Hansen, Nathan R. Kuncel, Steven P. Reise, and Michael C. Rodrigues, editors, APA
Handbook of Testing and Assessment in Psychology Volume 1: Test Theory and Testing and
Assessment in Industrial and Organizational Psychology, chapter 2, pages 21–42. American
Psychological Association, 750 First Street NE, Washington, D.C. 20002-4242, 2013.
[74] Kurt F. Geisinger, Bruce A. Bracken, Janet F. Carlson, Jo-Ida C. Hansen, Nathan R. Kuncel,
Steven P. Reise, and Michael C. Rodrigues, editors. APA Handbook of Testing and Assessment
in Psychology Volume 1: Test Theory and Testing and Assessment in Industrial and Orga-
nizational Psychology, 750 First Street NE, Washington, D.C. 20002-4242, 2013. American
Psychological Association.
[75] Kurt F. Geisinger, Bruce A. Bracken, Janet F. Carlson, Jo-Ida C. Hansen, Nathan R. Kuncel,
Steven P. Reise, and Michael C. Rodrigues, editors. APA Handbook of Testing and Assessment
in Psychology Volume 2: Testing and Assessment in Clinical and Counseling Psychology, 750
First Street NE, Washington, D.C. 20002-4242, 2013. American Psychological Association.
[76] Kurt F. Geisinger, Bruce A. Bracken, Janet F. Carlson, Jo-Ida C. Hansen, Nathan R. Kuncel,
Steven P. Reise, and Michael C. Rodrigues, editors. APA Handbook of Testing and Assessment
in Psychology Volume 3: Testing and Assessment in School Psychology and Education, 750
First Street NE, Washington, D.C. 20002-4242, 2013. American Psychological Association.
[77] Richard L. Gorsuch. Factor Analysis. Lawrence Erlbaum Associates, 2nd edition, 1983.
[78] A. Grant. Goodbye to MBTI, the Fad That Won’t Die. Psychology Today, 9 2013.
[80] James W. Grice. Computing and evaluating factor scores. Psychological Methods, 6(4):430–
450, 2001.
[81] H. Gulliksen. Theory of Mental Test Scores. Wiley, New York, 1950.
[82] Louis Guttman. A basis for analyzing test-retest reliability. Psychometrika, 42(4):252–282,
1945.
161
Behaviors Technical Manual References
[83] Ronald K. Hambleton, Wim J. van der Linden, and Craig S. Wells. IRT Models for the Anal-
ysis of Polytomously Scored Data: Brief and Selected History of Model Building Advances. In
Micheal L. Nering and Remo Ostini, editors, Handbook of Polytomous Item Response Theory
, chapter 2, pages 21–42. Routledge: Taylor & Francis Group, 2010.
[84] Harry H. Harman. Modern Factor Analysis. University of Chicago Press, Chicago and
London, third revised edition edition, 1976.
[85] D. Harnisch. 2015 style insights reliability study. Education and Human Resources, University
of Nebraska, TTI Success Insights, 2015.
[86] Rink Hoekstra, Jorien Vugteveen, Peter M. Kruyen, and Matthijs J. Warrens. An empirical
analysis of alleged misunderstandings of coefficient alpha. International Journal of Social
Research Methodology, 22(4):351–364, 2019.
[87] R. Hogan. Personality and the Fate of Organizations. Psychology Press, New York, 2015.
[88] W.H. Holtzman and S.B. Selss. The prediction of flying success by clinical analysis of test
protocols. Journal of Abnormal Social Psychology, 49:485–490, 1954.
[89] David W. Hosmer and Stanley Lemeshow. Applied Logistic Regression. John Wiley and Sons,
Inc., 2000.
[90] C. Hoyt. Test reliability estimated by analysis of variance. Psychometrika, 6:153–160, 1941.
[91] TTI Success Insights. Disparate impact study. Unpublished white paper., 2019.
[92] R. W. B. Jackson and G. A. Ferguson. Studies on the reliability of tests. Technical report,
University of Toronto, 1941.
[94] Charles E. Johnson, Robert Wood, and S.F. Blinkhorn. Spuriouser and spuriouser: The use
of ipsative personality tests. Journal of Occupational Psychology, 61:153–162, 1988.
[95] J.T.Lamiell. Individuals and the differences between them. In R. Hogan, J. Johnson, and
S. Briggs, editors, Handbook of Personality Psychology. Academic Press, San Diego, 1997.
[97] Robert M. Kaplan and Dennis P. Saccuzzo. Psychological Testing: Principles, Applications,
and Issues. Wadsworth, Belmont, CA, 2001.
[98] E.L. Kelly and D.W. Fiske. The Prediction of Performance in Clinical Psychology. University
of Michigan Press, Ann Arbor, MI, 1951.
[99] Jae-On Kim and Charles W. Mueller. Factor analysis: Statistical methods and practical
issues. Quantitative Applications in the Social Sciences, Sage Publications, Beverly Hills and
London, 1982.
162
Behaviors Technical Manual References
[100] Jae-On Kim and Charles W. Mueller. Introduction to factor analysis: What it is and how to
do it. Quantitative Applications in the Social Sciences, Sage Publications, Beverly Hills and
London, 1982.
[101] G. F. Kuder and M. W. Richardson. The theory of the estimation of test reliability. Psy-
chometrika, 2(3):151–160, 1937.
[102] J.T. Lamiell. On the utility of looking in the “wrong” direction. Journal of Personality,
48:82–88, 1980.
[103] J.T. Lamiell. The Psychology of Personality: An Epistemological Inquiry. Columbia Univer-
sity Press, New York, 1987.
[104] D.N. Lawley. On problems connected with item selection and test construction. Proceedings
of the Royal Society of Edinburgh, Series A, 23:273–287, 1943.
[105] P.F. Lazarsfeld. The logical and mathematical foundation of latent structure analysis. In S.A.
Stouffer, L. Guttman, E.A. Suchman, P.F. Lazarsfeld, S.A. Star, and J.A. Clausen, editors,
Measurement and Prediction. Princeton, NJ: Princeton University Press, 1950.
[106] P. Lecky. Review of Die psychische reeingflussing des sauglings. Internationale Zeitschrift für
Indivdualpsychologie, 6:506–507, 1928.
[107] P. Lecky. The theory of self-consistency in personal problems. Paper presented at the meeting
of the American College Personnel Association, 1935.
[108] P. Lecky. Preventing failure by removing resistance. Paper presented at the meeting of the
New York Society for the Experimental Study of Education, 1938.
[109] P. Lecky. Self-consistency. The Shoe String Press, New York, 1945.
[110] P. Lecky. Self-consistency: A Theory of Personality. Island Press, New York, 2nd edition,
1951. Edited and interpreted by F.C. Thorne.
[111] P. Lecky. Self-consistency: A Theory of Personality. Island Press Cooperation, Inc., New
York, 1961. Edited and interpreted by F.C. Thorne, reprinted by Island Press Cooperation,
Inc.
[112] Jane Loevinger. The attenuation paradox in test theory. Psychological Bulletin, 51:493–504,
1954.
[113] F. M. Lord. An application of confidence intervals and of maximum likelihood to the estima-
tion of an examinee’s ability. Psychometrika, 18:57–76, 1952.
[114] F. M. Lord. The relation of test score to the trait underlying the test. Educational and
Psychological Measurement, 13:517–548, 1952.
[115] F. M. Lord. A theory of test scores. Psychometric Monograph, (No. 7), 1952.
[116] Frederic M. Lord. Applications of Item Response Theory to Practical Testing Problems.
Lawrence Erlbaum, Hillsdale, NJ, 1980.
[117] Frederic M. Lord and Melvin R. Novick. Statistical Theories of Mental Test Scores. Addison-
Wesley Publishing Company, Reading, MA, 1968.
163
Behaviors Technical Manual References
[118] Richard E. Lucas and Brendan M. Baird. Global self-assessment. In Michael Eid and Ed Di-
ener, editors, Handbook of Mutlimethod Measurement in Psychology, pages 29–42. American
Psychological Association, Washington, DC, 2006.
[120] William M. Marston. Emotions of Normal People. Kegan Paul, Trench, Trubner, & Co. Ltd.,
New York, 1928.
[121] Albert Maydeu-Olivares. Limited information estimation and testing of thurstonian models
for paried comparison data under multiple juedgment sampling. Psychometrika, 66(2):209–
228, 2001.
[122] Albert Maydeu-Olivares and Ulf Böckeenholt. Structural equation modeling of paired-
comparison and ranking data. Psychological Methods, 10(3):285–304, 2009.
[123] Albert Maydeu-Olivares and Anna Brown. Ordinal factor analysis of graded preference ques-
tionnaire data. Multivariate Behavioral Research, 45:935–974, 2010.
[124] Albert Maydeu-Olivares and Anna Brown. Fitting a Thurstonian IRT model to forced-choice
data using Mplus. Behavior Research Methods, 44:1135–1174, 2012.
[125] Roderick P. McDonald. Factor Analysis and Related Methods. Lawrence Erlbaum Associates,
1985.
[126] Roderick P. McDonald. Test Theory: A Unified Treatment. Taylor and Francis, 1999.
[127] P.F. Merenda and W.V. Clarke. Self-description and personality measurement. Journal of
Clinical Psychology, 21(1):52–56, 1965.
[128] Samuel Messik. Meaning and values in test evaluation: The science of ethics and assessment.
Educational Researcher, 18(2):5–1, 1989.
[129] Samuel Messik. Validity of psychological assessment: Validation inferences from persons’
responses and performances as scientific inquiry into score meaning. American Psychologist,
50(9):741–749, 1995.
[130] Michael L. Nering and Remo Ostini, editors. Handbook of Polytomous Item Response Theory
Models. Routledge: Taylor & Francis Group, New York and London, 2010.
[131] Jum C. Nunnally. Psychometric Theory. McGraw-Hill, New York, 2nd edition, 1978.
[132] Thomas Oakland, Ype H. Poortinga, Justin Schlegal, and Ronald K. Hambleton. Inter-
national test commission: Its history, current status, and future directions. International
Journal of Testing, 1(1):3–32, 2001.
[133] Remo Ostini and Michael L. Nering. Polytomous Item Response Theory Models. In Quanti-
tative Applications in the Social Sciences, volume 144. Sage Publications, 2006.
[134] W.J. Owen, R. Borowsky, and G.E. Sarty. Fmri of two measures of phonological processing
in visual word recognitionl. ecological validity matters. Brain and Language, 90:40–46, 2004.
doi:10.1016/S0093-934X(03)00418-8.
164
Behaviors Technical Manual References
[135] Delroy L. Paulhus and Simine Vazire. The self-report method. In Richard W. Robins,
Chris R. Fraley, and Robert F. Krueger, editors, Handbook of Research Methods in Personality
Psychology, pages 224–239. Guildford, New York, 2007.
[136] Randall D. Penfield. Item Analysis. In Kurt F. Geisinger, Bruce A. Bracken, Janet F.
Carlson, Jo-Ida C. Hansen, Nathan R. Kuncel, Steven P. Reise, and Michael C. Rodrigues,
editors, APA Handbook of Testing and Assessment in Psychology Volume 1: Test Theory and
Testing and Assessment in Industrial and Organizational Psychology, chapter 7, pages 121–
138. American Psychological Association, 750 First Street NE, Washington, D.C. 20002-4242,
2013.
[137] Gjalt-Jorn Y. Peters. The alpha and omega of scale reliability and validity. The European
Health Psychologist, 16(2):56–69, April 2016.
[138] David M. W. Powers. Evaluation: From Precision Recall, and F-Factor to ROC, Informedness,
Markedness, and Correlation. Technical report, Flinders University of South Australia, School
of Informatics and Engineering, December 2007.
[139] G. Rasch. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen:
The Danish Institute for Educational Research, 1960. G. Leunbach, Trans.
[140] G. Rasch. An individualistic approach to item analysis. In P.F. Lazarsfeld and N.W. Henry,
editors, Readings in Mathematical Social Science, pages 89–107. Cambridge: MIT Press, 1966.
[141] G. Rasch. On specific objectivity: An attempt at formalizing the request for generality and
validity of scientific statements. Danish Yearbook of Philosophy, 14:58–94, 1977.
[142] Tenko Raykov and George A. Marcoulides. Introduction to Psychometric Theory. Routledge,
New York, 2011.
[143] William Revelle and Richard E. Zinbarg. Coefficients alpha, beta, omega, and the glb:
comments on sijtsma. Psychometrika, 74(1):145–154, 2009.
[144] H.J. Rimoldi and M. Hormaeche. The law of comparative judgment in the successive intervals
and graphic rating scale methods. Psychometrika, 20:307–318, 1955.
[146] F. Samejima. Estimation of latent ability using a response pattern of graded scores. Psy-
chometrika Monograph Supplement, 17, 1969.
[147] F. Samejima. A general model for free response data. Psychometrika Monograph Supplement,
18, 1972.
[148] F. Samejima. The grade response model of the latent trait theory and tailored testing. In C.L.
Clark, editor, Proceedings of the First Conference on Computerized Adaptive Testing, pages
5–17. Washington, DC: U.S. Civil Service Commission, Personnel Research and Development
Center, 1976. (Professional Series 75-6).
[149] F. Samejima. Constant information model on the dichotomous response level. In D.J. Weiss,
editor, The 1979 Conference on Computerized Adaptive Testing, pages 145–165. Washington,
DC: U.S. Civil Service Commission, Personnel Research and Development Center, 1979.
165
Behaviors Technical Manual References
[150] F. Samejima. Evaluation of the mathematical models for ordered polychotomous responses.
Behvariometrika, 23:17–35, 1996.
[151] F. Samejima. Departure from normal assumptions: A promise for future psychometrics with
substantive mathematical modeling. Psychometrika, 62:471–493, 1997.
[152] Richard J. Shavelson and Noreen M. Webb. Generalizability Theory: A Primer. Sage,
Thousand Oaks, CA, 1991.
[153] Nicole Shectman, Louise Yarnall, Regie Stites, and Britte Cheng. Empowering adults
to thrive at work: Personal success skills for 21st century jobs, a report on promis-
ing research and practice. https://www.sri.com/sites/default/files/publications/
joyceempoweringadultstothriveatwork_4.pdf, 2016. The Joyce Foundation, Chicago, IL.
[154] Lorrie A. Shepard. Evaluating test validity. In L. Darling-Hammon, editor, Review of Research
in Education, volume 19, pages 405–450. AERA, Washington, D.C., 1993.
[155] S.G. Sireci, L Patsula, and R.K. Hambleton. Statistical Methods for Identifying Flaws in
the Test Adaptation Process. In R.K. Hambleton, P. Merenda, and C. Spielberger, editors,
Adapting Educational and Psychological Tests for Cross-cultural Assessment, pages 93–116.
Lawrence Erlbaum Publishers, Mahwah, NJ, 2005.
[156] Stephan G. Sireci and Tia Sukin. Test Validity. In Kurt F. Geisinger, Bruce A. Bracken,
Janet F. Carlson, Jo-Ida C. Hansen, Nathan R. Kuncel, Steven P. Reise, and Michael C.
Rodrigues, editors, APA Handbook of Testing and Assessment in Psychology Volume 1: Test
Theory and Testing and Assessment in Industrial and Organizational Psychology, chapter 4,
pages 61–84. American Psychological Association, 750 First Street NE, Washington, D.C.
20002-4242, 2013.
[157] C. Spearman. The proof and measurement of association between two things. American
Journal of Psychology, 15:72–101, 1904.
[158] C. Spearman. Correlation calculated from faulty data. British Journal of Psychology,
3(3):271–295, 1910.
[159] Jon Starkweather. Step out of the past: Stop using coefficient alpha; there are better ways
to calculate reliability. Research and Statistical Support, University of North Texas, 2012.
[160] M.J. Stevens. Prescott Lecky: Pioneer in consistency theory and cognitive therapy. Journal
of Clinical Psychology, 48(6):807–811, 1992.
[161] Yoshio Takane and Jan de Leeuw. On the Relationship Between Item Response Theory and
Factor Analysis of Discretized Variables. Psychometrika, 52:393–408, 1987.
[162] Wei Tang, Ying Cui, and Oksana Babenko. Internal consistency: Do we really know what it
is and how to assess it? Journal of Psychology and Behavioral Science, 2(2):205–220, June
2014.
[163] Inc. Target Training International. Employee success prediction system, 1996. US Patent
5,551,880.
[164] Inc. Target Training International. Network based document distribution method (ids), 2007.
US Patent 7,249,372.
166
Behaviors Technical Manual References
[165] Inc. Target Training International. Position analysis system and method (benchmark), 2007.
US Patent 7,184,969.
[166] Ltd. Target Training International. 2017 talent insights reliability study. Unpublished research
study, 2017.
[168] L. L. Thurstone. A method of scaling psychological and educational tests. Journal of Edu-
cational Psychology, 16:433–451, 1925.
[169] L. L. Thurstone. Multiple Factor Analysis. University of Chicago Press, Chicago, 1947.
[170] L. R. Tucker. Maximum validity of a test with equivalent items. Psychometrika, 11:1–13,
1952.
[171] S.R. Wallace, W.V. Clarke, and R.J. Dry. The Activity Vector Analysis as a selector of life
insurance salesmen. Personnel Psychology, 49:37–345, 1956.
[172] D. Warburton. The importance of finding the real person at work. Department of Psychology,
University of Reading, Whiteknights, Reading, UK, 1998.
[173] D.M. Warburton and J.I. Suiter. Discovering the person at work. In B. Bonnstetter, J. Suiter,
and R. Widrick, editors, The Universal Language DISC: A Reference Manual. Target Training
International, Inc., Scottsdale, AZ, 1993.
[174] D.M. Warburton and J.I. Suiter. Discovering the person at work. In B. Bonnstetter, J. Suiter,
and R. Widrick, editors, The Universal Language DISC: A Reference Manual. Target Training
International, Inc., Scottsdale, AZ, 1999. Reprint and update of the 1993 version.
[175] D.J. Weiss. Introduction. In D.J. Weiss, editor, New Horizons in Testing: Latent Trait
Test Theory and Computerized Adaptive Testing, chapter 1, pages 1–8. New York: Academic
Press, 1983.
[176] Rand R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing: A Volume in
Statistical Modeling and Decision Science. Academic Press, New York, fourth edition edition,
2017.
[177] Edward W. Wiley, Noreen M. Webb, and Richard J. Shavelson. The Generalizability of
Test Scores. In Kurt F. Geisinger, Bruce A. Bracken, Janet F. Carlson, Jo-Ida C. Hansen,
Nathan R. Kuncel, Steven P. Reise, and Michael C. Rodrigues, editors, APA Handbook of
Testing and Assessment in Psychology Volume 1: Test Theory and Testing and Assessment
in Industrial and Organizational Psychology, chapter 3, pages 43–60. American Psychological
Association, 750 First Street NE, Washington, D.C. 20002-4242, 2013.
[178] Richard E. Zinbarg, William Revelle, Iftah Yovel, and Wen Li. Cronbach’s α, Revelle’s β,
and McDonald’s ωh : Their relations with each other and two alternative conceptualizations
of reliability. Psychometrika, 70(1):123–133, March 2005.
167
Behaviors Technical Manual Adaptation Protocols
Test translation is probably the more common term, but adaptation is the broader term
and refers to moving a test from one language and culture to another. Test adaptation
refers to all of the activities including: deciding whether or not a test in a second
language and culture could measure the same construct in the first language; selecting
translators; choosing a design for evaluating the work of test translators (e.g., forward
and backward translations); ...
1. Pre-conditions;
2. Test Development;
4. Administration;
6. Documentation.
Before going on to the guidelines, we present a short motivation for the importance of having
validated test adaptation protocols. Figure A.1 shows the emotional response of a multi-lingual
168
Behaviors Technical Manual Adaptation Protocols
individual to the word enthusiastic presented to the person in multiple languages. This person’s
native language is Spanish (Mexico). The following explanation of the brain images is taken from
[52].
She showed the maximum positive response (left-side activation) to her native language
(Spanish), a relatively neutral (no overt activation in either hemisphere) to English, a
moderately positive response (some left side activation) to French, and a strong and
complex response (activation on both sides) to German. This was consistent with the
fact that she was more comfortable with languages like her own and found German
difficult and requiring effort to understand.
A.1 Pre-conditions
There are three pre-condition guidelines in [53], the first of which is legal in nature, and the second
two are centered on ensuring what is being assessed is sufficiently understood in the same way
across all cultural and language groups of interest.
PC-1 Obtain the necessary permission from the holder of the intellectual property
rights relating to the test before carrying out any adaptation.
Given that most of the suite of TTI Success Insights assessments are developed by TTI Success
Insights, PC-1 should generally not be a problem. However, to cover all bases, it is necessary to at
least document in some fashion that we do, in fact, have all necessary permissions.
PC-2 Evaluate that the amount of overlap in the definition and content of the con-
struct measured by the test and the item content in the population of interest is
sufficient for the intended use (or uses) of the scores.
According to the explanation provided in [53], this guideline requires that what is assessed should
be understood in the same way across language and cultural groups, and this is the foundation
of valid cross-cultural comparisons. Since the ITC refers to this concept as being foundational,
it is critical to measure and report this in an appropriate manner. The ITC offers suggestions
that experts with respect to the construct measured and who are familiar with the cultural groups
being tested should be recruited to evaluate the legitimacy of the construct measured in each of
the cultural/linguistic groups.
PC-3 Minimize the influence of any cultural and linguistic differences that are irrel-
evant to the intended uses of the test in the populations of interest.
The comments on this guideline from the ITC are centered around appropriate choices of translators.
In summary, it is not enough to have individuals familiar with a language, even completely fluent
in the target language. The translators must not only be native speakers of the target language,
they must also be culturally knowledgeable, preferably native to the culture as well. It may be
shown through the test adaptation process that it is not enough to have a single French translation,
for example. It is necessary to determine whether large enough cultural differences exist between
the various French speaking areas TTI Success Insights is associated with to have a significant
impact on the interpretation or uses of the output of a given assessment. We may find that cultural
differences dictate that the French language version of an assessment developed for France may not
be suitable for French speaking Canadians. Regardless, the protocols should outline and discuss
how to document whether cultural or linguistic differences exist and impact test adaptation.
169
Behaviors Technical Manual Adaptation Protocols
TD-3 Provide evidence that the test instructions and item content have similar mean-
ing for all intended populations.
Note that this is the first guideline to use the term evidence. Our interpretation of that is a need
for data collection and analysis in some form. The ITC suggests the following possibilities:
The ITC also suggests that using small tryouts, conducting interviews of the administrators and
respondents, and cognitive interviewing have shown solid success in test adaptation protocol set-
tings.
TD-4 Provide evidence that the item formats, rating scales, scoring categories, test
conventions, modes of administration, and other procedures are suitable for all in-
tended populations.
The main concern expressed by the ITC in this setting is the possibility of confusion arising from
the administration platform (e.g., an individual is not familiar with computerized testing) or rating
scales (e.g., an individual has never seen a 5-point Likert scale) and similar for the other areas
above. It is suggested that either practice problems are provided or a thorough briefing accompany
the assessment in some form.
TD-5 Collect pilot data on the adapted test to enable item analysis, reliability assess-
ment, and small-scale validity studies so that any necessary revisions to the adapted
test can be made.
170
Behaviors Technical Manual Adaptation Protocols
This guideline should be fairly well understood. We need to conduct validity and reliability studies
on every language into which any of our assessments have been or will be translated. This is
obviously a tall order and will take a good deal of time, especially for any languages which service
smaller populations.
C-1 Select samples with characteristics that are relevant for the intended use of the
test and of sufficient size and relevance for the empirical analyses.
This appears to be a bit of an obvious statement. However, the subtleties underlying the statement,
and their implications, may or may not be as obvious. As noted in the first part of this series, there
are some fairly strict guidelines on how to translate or adapt a test from one language to another.
It would seem that if we require the target language translator to be native to the language and
the culture, then the test samples should also be native to the language and the culture. In other
words, if the target language is Spanish from the Americas, it would not be acceptable to take
samples from Spain or to find a Spanish speaking population in the U.S.
[53] also makes some interesting comments regarding reliability studies. We provide the following
quote from pp. 17:
However, it is often overlooked that reliability is a joint characteristic of the test and
the population (McDonald, 1999) - because it depends on both the true score variance
(population characteristic) and error variance (test characteristic). Therefore, the same
error variance can lead to a higher reliability simply due to the larger true score variance
in the source language group. (McDonald, 1999) shows that the Standard Error of
measurement (which is the square root of error variance) is in fact a more appropriate
quantity to compare between samples, not reliability.
It is noted that the reference to (McDonald, 1999) is not provided in the attached bibliography. At
the time of writing this article, the source [53] unfortunately does not provide the full reference in
its bibliography. The authors of this article are researching to find the exact source.
The Guidelines provide the following rough guidance on minimal sample sizes:
1. Identification of potentially biased test items require a minimum of 200 persons per version
of the test;
2. Item Response Theory (IRT) analyses require a minimum of 500 persons per version of the
test;
3. Factor analytic studies require a minimum of 300 persons per version of the test.
The authors of this article do note that in many studies conducted by the TTI research team have
shown that samples of 1000 still show some signs of instabilities in the convergence of IRT models
in Likert-style response formats. As such, we typically require a minimum of 1200 persons per
version of the test. It is also noted that in some cases this is a very difficult level of participation
to achieve and limited information may be available for smaller sample sizes.
171
Behaviors Technical Manual Adaptation Protocols
C-2 Provide relevant statistical evidence about the construct equivalence, method
equivalence, and item equivalence for all intended populations.
Before discussing the area of item equivalence, note that the approaches for construct and method
equivalence are discussed in part one of this series of articles under guidelines PC-2 and PC-3,
respectively.
Item equivalence is essentially studied under the name differential item functioning (DIF) analysis.
According to [53],
In general, DIF exists if two test-takers, from two different (cultural-linguistic) popula-
tions, have the same level of the measured trait but have different response probability
on a test item.
From the perspective of the TTI research team, one can replace the statement “different (cultural-
linguistic) populations” with “categories of populations” in the sense that in the U.S. we have a
need to study the existence of disparate impact assessment items may have on protected classes
(categories) such as race, gender, age, and so on.
[53] points to two specific areas as potential sources of DIF: translation problems and cultural
differences. As an example of a translation problem, it is necessary to avoid the use of vocabulary
in the target language that is correct in meaning but not commonly used in a subset of the target
language local culture. The following is a partial list of potential problem areas:
1. Change in item difficulty due to sentence length, complexity, or use of easy or difficult vocab-
ulary;
2. Change of meaning due to deletion of sentences, inaccurate translation, more than one mean-
ing in target language;
3. Non-equivalent meaning of words across cultures;
4. Cultural differences causing items to function different across languages.
[53] provides the following guidance on procedures to evaluate DIF.
1. IRT based methods;
2. Mantel-Haenszel (MH) procedure and extensions;
3. Logistic regression (LR) based procedures;
4. Restricted factor analytic (RFA) procedures.
While it was noted earlier that construct equivalence is covered, in part, in PC-2, [53] states the
following additional information about construct equivalence:
Finally, yet importantly, this guideline will require researchers to address construct
equivalence. There are at least four statistical approaches for assessing construct equiv-
alence across source and target language versions of a test: Exploratory factor analysis
(EFA), confirmatory factor analysis (CFA), multidimensional scaling (MDS), and com-
parison of nomological networks (Sireci, Patsula, & Hambleton, 2005).
The reference for (Sireci, Patsula, & Hambleton, 2005) is provided in [155]. It is noted in [53] that
there are no commonly agreed-upon rules for deciding when structures can be considered equivalent.
Hence, as EFA does not accommodate separate factor structures, it is more desirable to use CFA,
weighted MDS (WMDS), or Exploratory Structural Equation Modeling (ESEM).
172
Behaviors Technical Manual Adaptation Protocols
C-3 Provide evidence supporting the norms, reliability, and validity of the adapted
test version of the test in the intended populations.
Directly from [53]:
The norms, validity evidence, and reliability evidence of a test in its source language ver-
sion do not automatically apply to other possible adaptations of the test in to different
cultures and languages.
Further, [53] suggests essentially following the standards set forth in [5] as well as in [156]. These
standards state that one should provide evidence of validity based on the following five areas:
1. Test content;
2. Internal structure;
4. Response processing;
5. Consequences of testing.
[53] also provides a short list of approaches to internal structure validity evidence such as EFA,
CFA, structural equation modeling (SEM), and multitrait-multimethod analyses.
C-4 Use an appropriate equating design and data analysis procedures when linking
score scales from different language versions of a test.
This topic may be more readily understood in the language of parallel form tests. For example,
given a high stakes testing environment, there may be an increased risk of cheating. In such cases,
multiple versions of the test may be produced to lessen the probability of an individual successfully
cheating. However, how does one know if a given score on, say, Form A is equivalent to that same
score on Form B? It may be the case that one form or the other is more or less difficult than its
counterpart(s). As an example, it may be the case that a score of 70% on Form A is equivalent to
a score of 60% on Form B due to an increased level of difficulty on Form B.
The process of determining the equivalence of scores in this example is called equating scores. If
we are looking at different adaptations of a test into other languages and/or cultures, the process is
called linking the scores. In order to establish the linking process, [53] suggests it is best to address
the following three questions:
1. Is there evidence that the same construct is being measured in the source and target language
versions of the test? Does the construct have the same relationship with other external
variables in the new culture?
2. Is there strong evidence that sources of method bias have been eliminated?
A-1 Prepare administration materials and instructions to minimize any culture- and
language-related problems that are caused by administration procedures and response
modes that can affect validity of the inferences drawn from the scores.
The idea here is to implement the administration guidelines in such a manner as to address all
factors that may threaten the validity of test scores in specific cultural and linguistic contexts.
Examples include being aware of any aspects of instructions that may have posed problems in the
original language or in other adapted versions. The key to successfully addressing this guideline is
being prepared to anticipate potential problems and posing potential solutions a priori.
Some areas of concern may be clarity of instructions, response mechanisms, allowable time, moti-
vation to complete, knowledge of the purpose, and how the assessment is to be scored.
A-2 Specify testing conditions that should be followed closely in all populations of
interest.
This guideline is all about establishing standard procedures that all assessment respondents should
be required to adhere to. Some accommodations may be warranted in specific circumstances in
order to ensure that each respondent has the best opportunity to respond to the assessment at
their best, whatever that may mean.
Note that any variations should be noted so that these differences may be accounted for and
analyzed as necessary. Their impact on generalizations and interpretations may or may not be
significant. Some important aspects of the assessment process include ensuring testing instructions
and related procedures are adapted in a standardized way which is suitable to the new language
and culture and ensuring that administrators are trained on the new procedures. It is also highly
important for the administrators to respect the new procedures and not to “hold on” to
the old guard.
SSI-1 Interpret any group score differences with reference to all relevant available
information.
While this guideline is part of the ITC adaptation guidance, this advice is pertinent and relevant
to anyone in the assessment industry. One should not take score differences across groups to be a
one-off comparison. All relevant possibilities for differences in scoring should be taken into account
regardless of whether there are differences in language and culture or not.
This is a particularly relevant point in the U.S. given the overly litigious society in existence here.
In multiple U.S. Supreme Court Decisions, the Court has held that differences in assessment scale
scores alone is not enough to establish a disparate impact claim, although many plaintiffs over the
years have attempted to claim that these differences should be sufficient for such claims. We do
not present specific cases, although the interested reader may contact the authors for a list of such
cases and decisions.
Some prime examples of reasons that score differences may exist, and that need to be ruled out
prior to drawing conclusions, include differential motivation, less effective education systems, lack of
familiarity with testing/assessment procedures, and understanding the purpose of the assessment.
174
Behaviors Technical Manual Adaptation Protocols
SSI-2 Only compare scores across populations when the level of invariance has been
established on the scale on which scores are reported.
This section is at least partially linked to an earlier topic called “linking” or “equating”. This is
meant to establish evidence that construct, method, and item bias are not present in the adapted
version of the assessment. According to [53], pp. 27,
...valid direct score comparisons can only be done when scores show the highest level of
equivalence, called scalar equivalence or full score equivalence, which requires scales in
each group have the same measurement unit and the same origin across groups.
The authors of [53] state that the key point is to not over-interpret assessment scores. For example,
do not make comparative statements about respondent performance across language groups unless
measurement invariance has been established.
Doc-2 Provide documentation for test users that will support good practice in the
use of an adapted test with people in the context of the new population.
This documentation should be written for the test user to be used in a practical setting, consistent
with the ITC Guidelines on Test Use. For example, such a user manual should contain, at a
minimum, the following:
1. Description of the construct(s) measured and summarize the information, include a descrip-
tion of the adaptation process;
2. Summaries of evidence supporting the adaptation, including cultural suitability of content,
instructions, response format, etc.;
3. Definitions of suitability of use of test within various subgroups within the population and
any restrictions on use;
4. Explanations of issues that may need to be considered in relation to good practice and test
administration;
5. Explanation of whether and how inter-population comparisons may be made;
6. Information necessary for scoring and norming, or a description of how users may access
scoring procedures;
7. Guidelines for the interpretation of results, including information on the implications of va-
lidity and reliability of the data on the inferences that may be drawn from the scores.
175
Behaviors Technical Manual Predictive Model Results
Remark B.1. Currently, information gathering questions precede all questionnaires collected via
the TTI Success Insights Internet Delivery System (IDS). The nature of these questions is to gather
such information as name, email, and gender, to name a few. Some of this information is used to
generate unique identifiers during the data storage process. Others are used for report generation,
i.e., selecting the appropriate gendered pronoun for use in the report. At the time of writing of this
manual, TTI Success Insights was in the process of generating an up-to-date process for providing
alternate pronoun options for individuals falling outside the male-female binary gender choice.
A more complete description of this process is scheduled to be incorporated in this manual upon
completed incorporation.
Remark B.2. All demographic variables currently collected in the TTI Success Insights assessment
process are done so on a voluntary basis. Any assessment respondent may refuse to answer any or
all of the demographic questions presented. The lone exception is noted in Remark B.1. Technically
speaking, gender is not collected as part of the demographics collection. It has been historically
collected as part of the information used to generate reports for the individual.
Remark B.3. Given that many of the TTI Success Insights assessment users participate in certain
subsets of hiring processes with their clients, an option is provided in the TTI Success Insights
system to select a “hiring link” when generating a series of assessments for a given client. When
this hiring link is selected, certain demographic information is no longer offered as part of the
collection process. For example, age and ethnicity are not longer offered as part of the demographic
collection. Further, as noted above, no information collected by TTI Success Insights is available
to any individual or entity outside TTI Success Insights.
176
Behaviors Technical Manual Predictive Model Results
A brief description of the database and subsets used in this study are provided in the following
tables. The data analyzed were collected between April 01, 2018 and August 31, 2020. The data
set contains 549,782 total respondents.
177
Behaviors Technical Manual Predictive Model Results
178
Behaviors Technical Manual Predictive Model Results
Note B.4. ∗ There are more classes of Veteran Status than are presented here. A decision was
made to collect the above four cases rather than the extended version. The interested reader may
consult the U.S. Department of Veterans Affairs, https: // www. va. gov , or the relevant sections
of the United States Code (U.S.C.) pertaining to definitions of veteran status.
179
Behaviors Technical Manual Predictive Model Results
180
Behaviors Technical Manual Predictive Model Results
181
Behaviors Technical Manual Predictive Model Results
182
Behaviors Technical Manual Predictive Model Results
183
Behaviors Technical Manual Predictive Model Results
184
Behaviors Technical Manual Predictive Model Results
185
Behaviors Technical Manual Predictive Model Results
186
Behaviors Technical Manual Predictive Model Results
187
Behaviors Technical Manual Predictive Model Results
Table B.54: Entertainers and Performers, Sports & Related Workers, All Other
Area Under Curve, N = 104
188
Behaviors Technical Manual Predictive Model Results
189
Behaviors Technical Manual Predictive Model Results
190
Behaviors Technical Manual Predictive Model Results
Table B.66: Food Preparation & Serving Related Workers, All Other
Area Under Curve, N = 140
191
Behaviors Technical Manual Predictive Model Results
Table B.74: Heating, Air Conditioning, & Refrigeration Mechanics & Installers
Area Under Curve, N = 107
192
Behaviors Technical Manual Predictive Model Results
193
Behaviors Technical Manual Predictive Model Results
194
Behaviors Technical Manual Predictive Model Results
Table B.86: Life, Physical, & Social Science Technicians, All Other
Area Under Curve, N = 146
195
Behaviors Technical Manual Predictive Model Results
196
Behaviors Technical Manual Predictive Model Results
197
Behaviors Technical Manual Predictive Model Results
198
Behaviors Technical Manual Predictive Model Results
199
Behaviors Technical Manual Predictive Model Results
200
Behaviors Technical Manual Predictive Model Results
201
Behaviors Technical Manual Predictive Model Results
202
Behaviors Technical Manual Predictive Model Results
203
Behaviors Technical Manual Predictive Model Results
204
Behaviors Technical Manual Predictive Model Results
205
Behaviors Technical Manual Predictive Model Results
206
Behaviors Technical Manual Predictive Model Results
Note B.37. ∗ Statistical Significance at 0.90 level, ? Follow Up and Follow Through
207
Behaviors Technical Manual Predictive Model Results
208
Behaviors Technical Manual Predictive Model Results
209
Behaviors Technical Manual Predictive Model Results
210
Behaviors Technical Manual Predictive Model Results
Note B.42. ∗ No statistically significant model was found for this sample
Note B.43. ∗ No statistically significant model was found for this sample
211
Behaviors Technical Manual Predictive Model Results
Note B.47. ∗ Frequent Interaction With Others, ? Follow Up and Follow Through
212
Behaviors Technical Manual Predictive Model Results
Note B.48. ∗ No significant model for this data set, ? Follow Up and Follow Through
213
Behaviors Technical Manual Predictive Model Results
Note B.52. ∗ No significant model for this data set, ? Customer Relations, † Analysis of Data.
214
Behaviors Technical Manual Predictive Model Results
Note B.54. ∗ People Oriented, ? Analysis of Data, † Frequent Interaction With Others
215
Behaviors Technical Manual Predictive Model Results
Note B.55. ∗ Frequent Interaction With Others, ? Following Policy, † Model is significant at the
90% confidence level.
Note B.56. ∗ People Oriented, ? Competitiveness, † Urgency, †† Model is significant at the 90%
confidence level.
216
Behaviors Technical Manual Predictive Model Results
217
Behaviors Technical Manual Predictive Model Results
218
Behaviors Technical Manual Predictive Model Results
219
Behaviors Technical Manual Predictive Model Results
Note B.63. ∗ model significant at the 90% confidence level for this data set.
Note B.64. ∗ model significant at the 90% confidence level for this data set.
220
Behaviors Technical Manual Discrimination vs. Difficulty Charts
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
221
Behaviors Technical Manual Discrimination vs. Difficulty Charts
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
222
Behaviors Technical Manual Discrimination vs. Difficulty Charts
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
223
Behaviors Technical Manual Discrimination vs. Difficulty Charts
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
224
Behaviors Technical Manual Discrimination vs. Difficulty Charts
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
225
Behaviors Technical Manual Discrimination vs. Difficulty Charts
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
0.8
Discrimination
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Difficulty
226
Behaviors Technical Manual Inter-item Correlation
Item/Scale D I S C
Item 1 0.28 0.21 0.22 0.19
Item 2 0.26 0.21 0.21 0.18
Item 3 0.26 0.21 0.21 0.19
Item 4 0.26 0.22 0.21 0.19
Item 5 0.27 0.20 0.20 0.18
Item 6 0.27 0.20 0.21 0.19
Item 7 0.26 0.21 0.21 0.19
Item 8 0.27 0.20 0.21 0.18
Item 9 0.26 0.20 0.21 0.19
Item 10 0.27 0.20 0.21 0.19
Item 11 0.26 0.21 0.21 0.19
Item 12 0.27 0.20 0.21 0.18
Item 13 0.27 0.21 0.21 0.18
Item 14 0.27 0.20 0.22 0.19
Item 15 0.26 0.20 0.20 0.18
Item 16 0.26 0.21 0.20 0.19
Item 17 0.26 0.21 0.21 0.18
Item 18 0.26 0.20 0.20 0.19
Item 19 0.27 0.20 0.20 0.18
Item 20 0.27 0.21 0.21 0.19
Item 21 0.26 0.21 0.21 0.19
Item 22 0.26 0.21 0.21 0.18
Item 23 0.26 0.22 0.21 0.19
Item 24 0.27 0.21 0.21 0.19
AVG 0.27 0.21 0.21 0.19
227
Behaviors Technical Manual Inter-item Correlation
Item/Scale D I S C
Item 1 0.27 0.21 0.21 0.19
Item 2 0.25 0.20 0.21 0.18
Item 3 0.26 0.21 0.20 0.19
Item 4 0.26 0.21 0.20 0.19
Item 5 0.27 0.20 0.20 0.18
Item 6 0.26 0.20 0.20 0.18
Item 7 0.26 0.21 0.21 0.18
Item 8 0.26 0.20 0.21 0.18
Item 9 0.26 0.20 0.21 0.19
Item 10 0.26 0.20 0.21 0.19
Item 11 0.26 0.21 0.20 0.19
Item 12 0.26 0.20 0.21 0.18
Item 13 0.26 0.21 0.21 0.18
Item 14 0.27 0.20 0.22 0.18
Item 15 0.26 0.20 0.20 0.18
Item 16 0.25 0.21 0.20 0.19
Item 17 0.26 0.21 0.21 0.18
Item 18 0.26 0.20 0.20 0.18
Item 19 0.27 0.20 0.20 0.18
Item 20 0.27 0.21 0.21 0.19
Item 21 0.26 0.21 0.20 0.19
Item 22 0.26 0.21 0.21 0.18
Item 23 0.26 0.21 0.20 0.18
Item 24 0.26 0.21 0.21 0.19
AVG 0.26 0.21 0.21 0.18
228
Behaviors Technical Manual Inter-item Correlation
Item/Scale D I S C
Item 1 0.26 0.21 0.20 0.18
Item 2 0.24 0.20 0.20 0.17
Item 3 0.25 0.21 0.19 0.18
Item 4 0.25 0.21 0.19 0.18
Item 5 0.26 0.20 0.19 0.17
Item 6 0.25 0.20 0.19 0.18
Item 7 0.25 0.21 0.20 0.18
Item 8 0.25 0.20 0.20 0.17
Item 9 0.25 0.20 0.20 0.18
Item 10 0.25 0.20 0.20 0.18
Item 11 0.25 0.21 0.19 0.18
Item 12 0.25 0.20 0.20 0.18
Item 13 0.25 0.21 0.20 0.18
Item 14 0.26 0.19 0.21 0.18
Item 15 0.25 0.20 0.19 0.17
Item 16 0.24 0.21 0.19 0.18
Item 17 0.25 0.21 0.20 0.17
Item 18 0.25 0.20 0.19 0.18
Item 19 0.26 0.20 0.19 0.17
Item 20 0.26 0.20 0.20 0.18
Item 21 0.25 0.21 0.19 0.18
Item 22 0.25 0.20 0.20 0.17
Item 23 0.25 0.21 0.19 0.18
Item 24 0.25 0.20 0.20 0.18
AVG 0.25 0.20 0.20 0.18
229
Behaviors Technical Manual Item-Total Correlation
Item/Scale D I S C
Item 1 0.24 0.37 0.26 0.27
Item 2 0.62 0.44 0.38 0.56
Item 3 0.57 0.36 0.52 0.35
Item 4 0.52 0.20 0.45 0.24
Item 5 0.38 0.58 0.58 0.56
Item 6 0.48 0.49 0.45 0.43
Item 7 0.51 0.29 0.37 0.42
Item 8 0.44 0.50 0.41 0.54
Item 9 0.56 0.60 0.41 0.31
Item 10 0.47 0.53 0.40 0.25
Item 11 0.58 0.27 0.48 0.28
Item 12 0.46 0.58 0.44 0.45
Item 13 0.48 0.38 0.43 0.46
Item 14 0.33 0.63 0.11 0.39
Item 15 0.52 0.56 0.53 0.48
Item 16 0.63 0.40 0.57 0.28
Item 17 0.58 0.23 0.39 0.51
Item 18 0.57 0.48 0.53 0.40
Item 19 0.33 0.57 0.53 0.50
Item 20 0.32 0.40 0.33 0.38
Item 21 0.57 0.22 0.46 0.31
Item 22 0.55 0.41 0.42 0.49
Item 23 0.53 0.21 0.46 0.41
Item 24 0.47 0.41 0.29 0.30
AVG 0.49 0.42 0.42 0.40
Std.Dev. 0.57 0.50 0.48 0.48
230
Behaviors Technical Manual Item-Total Correlation
Item/Scale D I S C
Item 1 0.24 0.38 0.26 0.27
Item 2 0.61 0.45 0.37 0.58
Item 3 0.57 0.35 0.51 0.35
Item 4 0.52 0.21 0.45 0.25
Item 5 0.36 0.59 0.58 0.55
Item 6 0.49 0.49 0.46 0.42
Item 7 0.53 0.31 0.37 0.42
Item 8 0.43 0.50 0.41 0.53
Item 9 0.55 0.59 0.41 0.32
Item 10 0.47 0.53 0.40 0.25
Item 11 0.58 0.25 0.48 0.27
Item 12 0.44 0.59 0.43 0.46
Item 13 0.46 0.37 0.42 0.44
Item 14 0.32 0.64 0.12 0.39
Item 15 0.51 0.55 0.52 0.48
Item 16 0.61 0.39 0.56 0.27
Item 17 0.57 0.23 0.39 0.52
Item 18 0.57 0.47 0.53 0.40
Item 19 0.34 0.57 0.53 0.49
Item 20 0.30 0.41 0.32 0.36
Item 21 0.57 0.21 0.46 0.30
Item 22 0.55 0.41 0.42 0.49
Item 23 0.53 0.22 0.46 0.41
Item 24 0.47 0.39 0.27 0.28
AVG 0.48 0.42 0.42 0.40
Std.Dev. 0.57 0.50 0.48 0.48
231
Behaviors Technical Manual Item-Total Correlation
Item/Scale D I S C
Item 1 0.24 0.38 0.24 0.27
Item 2 0.61 0.45 0.36 0.57
Item 3 0.56 0.36 0.51 0.34
Item 4 0.51 0.20 0.43 0.24
Item 5 0.37 0.57 0.56 0.54
Item 6 0.48 0.48 0.45 0.41
Item 7 0.50 0.31 0.36 0.42
Item 8 0.40 0.50 0.39 0.52
Item 9 0.55 0.59 0.40 0.31
Item 10 0.46 0.53 0.39 0.23
Item 11 0.57 0.25 0.46 0.27
Item 12 0.43 0.57 0.42 0.45
Item 13 0.46 0.36 0.40 0.44
Item 14 0.30 0.63 0.09 0.39
Item 15 0.50 0.55 0.51 0.46
Item 16 0.60 0.38 0.55 0.25
Item 17 0.55 0.25 0.37 0.51
Item 18 0.55 0.48 0.51 0.38
Item 19 0.34 0.57 0.52 0.48
Item 20 0.30 0.41 0.32 0.38
Item 21 0.56 0.21 0.46 0.27
Item 22 0.53 0.42 0.38 0.49
Item 23 0.52 0.22 0.44 0.39
Item 24 0.46 0.40 0.27 0.28
AVG 0.47 0.42 0.41 0.39
Std.Dev. 0.57 0.50 0.48 0.48
232
Behaviors Technical Manual Internal Consistency Charts
Est./Scale D I S C
α 0.897 0.862 0.864 0.847
S.E. 0.001 0.002 0.002 0.002
Upper CI 0.900 0.865 0.867 0.850
Lower CI 0.893 0.859 0.861 0.843
Est./Scale D I S C
α 0.894 0.861 0.861 0.844
S.E. 0.001 0.001 0.001 0.002
Upper CI 0.897 0.864 0.864 0.847
Lower CI 0.891 0.858 0.858 0.841
Est./Scale D I S C
α 0.889 0.860 0.854 0.839
S.E. 0.001 0.001 0.001 0.002
Upper CI 0.892 0.863 0.857 0.842
Lower CI 0.886 0.857 0.851 0.835
233
Behaviors Technical Manual α-if-item-deleted
G α-if-item-deleted Tables
This section contains the historical α-if-item-deleted charts for the TTI Success Insights Style
Insights assessment over the last decade. The current α-if-item-deleted results are presented in
Section 7.4.
234
Behaviors Technical Manual α-if-item-deleted
235
Behaviors Technical Manual α-if-item-deleted
236
Behaviors Technical Manual α-if-item-deleted
237
Behaviors Technical Manual α-if-item-deleted
238
Behaviors Technical Manual α-if-item-deleted
239
Behaviors Technical Manual Development
Lit Review
Identify
Item Bank
Item
Analysis
Response
Content Distri- Iterate
butions
Launch
Factor
Analysis
Item
Removal
Pilot
Item Anal-
EFA/IRT
ysis/IRT
Forced-
Iterate CFA/IRT Field Test
Choice
Launch Likert
240