Differential psychologists use the correlational nature of data to make inferences about the nature and structure of psychological traits. In cognitive psychology, differential approaches have a long and important history, dating back to the seminal work of Binet, Cattell, Horn, Thurstone, Thorndike, and Spearman (McGrew, 2009). Latent variable analysis allowed theories to be tested more rigorously and quantitatively. A given model will hypothesize a certain underlying factor structure that is thought to reflect a set of underlying cognitive dimensions or processes. In turn, model fit is assessed by the degree to which a particular factor structure recreates an observed correlational structure, adjusted for parsimony. The findings from such models have nontrivial implications. For example, should we consider working memory a unitary system or distinguish between verbal and nonverbal stores (Kane et al., 2004)? Is intelligence a single construct or better described by multiple types of intelligence (Carroll, 1993)? To what extent does “executive functioning” represent a unitary versus diverse set of functions (Miyake et al., 2000)? What explains the relationship between working memory capacity and fluid intelligence (Draheim et al., 2021; Unsworth et al., 2014; Unsworth & Spillers, 2010)?

Because factor structures are used to test theories, it is worth asking how subtle methodological differences can impact those factor structures. In a typical latent variable design, participants complete a task battery designed to measure a set of cognitive abilities. Multiple tasks of each cognitive ability are administered, and performance on these tasks can be set to load onto factors. The resulting factor structure is then used to test competing predictions made by different theories. Here, we specifically examined one subtle methodological difference: task sequencing. We were motivated by the observation that differential cognitive psychologists rarely randomize the sequencing of tasks in a latent variable analysis. However, this approach seems to violate two critical tenets of experimentation: randomization and counterbalancing.

To provide an overview of the methods researchers have used in such studies, we examined the Method of a sample of latent variable analyses of cognitive abilities (see Table 1).Footnote 1 We categorized task sequencing into three general methods: grouped, interleaved, and random. In a grouped sequence, tasks designed to tap the same cognitive construct (e.g., working memory) are completed consecutively. In an interleaved sequence, the tasks are shuffled such that those measuring the same construct are not completed consecutively. In a random sequence, each participant completes the tasks in a different random order. Of the studies reviewed, only two used a random sequence (Colom et al., 2004; Kyllonen & Christal, 1990). Some researchers state that they present the tasks in a fixed sequence to avoid order-related confounds. For example, Miyake et al. (2000) state, “The order of task administration was fixed for all participants (with the constraint that no two tasks that were supposed to tap the same executive function occurred consecutively) to minimize any error due to participant by order interaction” (p. 66). In other instances, the researchers used a grouped sequence, but counterbalanced the order in which the tasks were delivered. For example, Engle et al. (1999) gave measures of working and short-term memory during the first and second days of a 3-day study and gave measures of fluid intelligence on the third day. On Days 1 and 2, they counterbalanced the sequencing of the working and short-term memory tasks across participants. In other cases, consideration was given to the sequencing of tasks, but no differences were observed, and results were reported after collapsing across task orders (Shelton et al., 2009). The main question then is whether these choices lead to different factor structures.

Table 1 Task sequences in a sample of latent-variable analysis of cognitive abilities

This issue has been considered in other fields, such as survey development and personality assessment. For example, Goodhue and Loiacono (2002) noted that clustering survey items by construct inflates reliability. Other studies have compared randomized item ordering, grouped/clustered orders, and randomized orders (Buchanan et al., 2018; Loiacono & Wilson, 2020; Wilson & Lankton, 2012; Wilson et al., 2017). In a comprehensive examination, Wilson et al. (2021) examined five types of item ordering and clustering in an online survey. They had five conditions which either fixed, interleaved, or randomized the sequence of items in different ways. Wilson et al. found significant differences across orderings in item means, cluster means, reliability, and differences in participant reports of fatigue and frustration. In contrast, Schell and Oswald (2013) assessed 50 Big Five personality items using three item orders—randomized per individual, items grouped by factor, and a fixed order of items interleaved across factors—but did not find any differences in the measurement model or in the internal consistencies of the factors. To our knowledge, no study has systematically evaluated task sequencing in the context of a latent, construct-level analysis of cognitive abilities.Footnote 2 That was the central goal of the present study.

If any differences in factor structure should emerge, we hypothesized that a grouped sequence, compared with an interleaved or random sequence, would cause an increase in the magnitude of task loadings onto respective factors (Hypothesis 1) and reduce the magnitude of interfactor correlations (Hypothesis 2). If tasks designed to measure a specific construct are presented consecutively, there will be at least two systematic sources of covariance among them: the latent cognitive ability that causes performance differences on those tasks and their shared temporal variance. In a grouped sequence, any temporal and contextual covariates that might influence one’s performance (e.g., fatigue, stress, time of day) will increase the covariance among those tasks. This would be especially true when measures of different constructs are given on different days. Thus, the factor will be a conflation of “true” covariance and temporal/contextual covariance. In turn, the individual factors will have less covariance. Quantitatively, this would manifest in two ways: (1) a systematic increase in the magnitude (i.e., absolute value) of factor loadings, and (2) a systematic decrease in the magnitude of interfactor correlations. Here, we tested these hypotheses empirically. Specifically, we administered a battery of 12 cognitive tasks with three each selected to measure working memory capacity, attention control, long-term memory, and fluid intelligence. These four constructs have been demonstrated to be distinct yet correlated (Unsworth et al., 2014). We randomly assigned participants to complete the tasks in either a grouped, interleaved, or random sequence. Finally, we tested whether the conditions yielded differences in average performance, latent factor loadings, and interfactor correlations.

Method

We report all variables, how we determined our sample size, and all exclusions, when necessary. All data, materials,Footnote 3 and analysis scripts are publicly available on the Open Science Framework (https://osf.io/a79hf/).

Participants and procedure

A priori, we targeted a sample size of 600 participants (200 per condition) based on simulations from Kretzschmar and Gignac (2019). We used the end of an academic semester as our stopping rule for data collection, and we finished just short of our target sample size with 598 participants. After exclusions (see below), the final sample analyzed included 587 participants (172 in grouped condition, 217 in interleaved condition, and 198 in random condition; 44% participants identified as women, 54% as men, 1% as nonbinary or other gender, and one participant did not to report gender; Mage = 19.01 years, SDage = 1.49, range: 17–37; 12% of participants identified as Asian, 6% as Black or African American, 1% as Native Hawaiian or Pacific Islander, 21% as Hispanic or Latino, 55% as White, and 4% as other race/ethnicity). All participants were undergraduate students at Arizona State University who completed the study in exchange for partial course credit. Participants completed the study in groups of four to eight. Prior to beginning the study, participants read and signed an informed consent document. The experimental protocol was approved by the Institutional Review Board at Arizona State University.

We created an R script that first randomly assigned a subject ID number to a condition. For the random condition, the R script took the 12 task labels and shuffled them. For all conditions, the script printed the task sequence onto checklists (see Fig. 1 for task orders by condition). The research assistant then used this printed checklist to administer the task sequence for each participant in precisely the order listed on the sheet. All sessions took approximately 2 hours to complete.

Fig. 1
figure 1

Task sequences by condition. The sequence in the grouped and interleaved conditions was fixed for all participants, whereas the sequence in the random condition was different for every participant. “*” represents an example task sequence

Tasks

Working memory capacity

Operation span (Unsworth et al., 2005)

Participants were required to remember and recall lists of letters while solving math problems as a secondary processing task. On each list, a single letter appeared for 1 s, followed by a math problem (e.g., (2 × 5) + 3 = ?). The participant clicked the mouse to indicate that they had solved the problem. Then, they were shown a solution, and they clicked boxes labeled “true” or “false” to indicate whether the solution solved the problem. The process repeated for a list of three to seven items. Each list length was presented twice. At the end of a list, participants were presented with a grid of the possible letters, and their task was to report the letters in correct forward serial order. The dependent variable was the total number of letters reported in the correct serial position (maximum score = 50). Footnote 4

Symmetry span (Unsworth, Redick, et al., 2009b)

Participants were required to remember sequences of spatial locations while making symmetry judgments as a secondary processing task. On each list, a single location within a 4 × 4 black-and-white grid flashed red for 500 ms. Then, participants were presented with a black-and-white pattern, and their task was to determine whether the pattern was symmetrical about its y-axis. When they had made their judgment, they clicked the mouse. Then clicked one of two boxes labeled “true” or “false.” This process repeated for two to five items. Each list length was presented twice. At the end of each list, the participants were presented with an empty 4 × 4 grid and asked to click the locations that appeared on the list in forward serial order. The dependent variable was the total number of locations reported in the correct serial position (maximum score = 28).

Reading span (Unsworth, Redick, et al., 2009b)

Participants were required to remember and recall lists of letters while making judgments about the sensibleness of sentences as a secondary processing task. On each list, a single letter appeared for 1 s, followed by a sentence (e.g., Jack went up the hill to fetch a trash of water). The participant clicked the mouse to indicate that they had made the sentence judgment. Then, they were shown boxes labeled “true” and “false” and clicked a box to indicate whether the sentence made sense or not. The process repeated for a list length of three to seven items. Each list length was presented twice. At the end of a list, participants were presented with a grid of the possible letters, and their task was to report the letters in correct forward serial order by clicking a box next to each letter. The dependent variable was the total number of letters reported in the correct serial position (maximum score = 50).

Attention control

Antisaccade (Hutchison, 2007; Kane et al., 2001)

On each trial, a central fixation stimulus (***) appeared for either 1 or 2 s. Then, a flashing white cue (=) appeared on either the right or left side of the screen for 300 ms. A target letter (O or Q) then flashed for 100 ms on the opposite side of the screen, followed by a backward mask (#) until the participant made their response. Participant made their responses with the O and Q keys of the keyboard. The next trial started after 1-s blank intertrial interval. Participants first received 8 trials of slow-paced practice, in which the target appeared for 500 ms, then 16 trials of fast-paced practice with a 100-ms target duration, and finally 72 experimental trials. The dependent variable was proportion correct on the experimental trials.

PVT (Dinges & Powell, 1985; Wilkinson & Houghton, 1982)

On each trial, a millisecond counter appeared at the center of the screen (00.000). Then, after a random time interval ranging from 2 to 8 s, the timer began counting like a stopwatch. The participant’s task was to press the spacebar as quickly as possible when the timer started. There were five practice trials followed by 75 experimental trials. Reaction times across experimental trials were then sorted from fastest to slowest for each participant and binned into quintiles. The dependent variable was the average of each participants slowest quintile (Unsworth & Spillers, 2010).

SART (Robertson et al., 1997)

On each trial, a single digit (1–9) appeared at the center of the screen for 300 ms, followed by a 900-ms blank intertrial interval. The participant’s task was to press the spacebar upon seeing any digit except 3. Participants were instructed to withhold their response upon seeing the digit 3. There were 450 trials, 11% of which were “no-go” trials. The dependent variable was the standard deviation of reaction times on “go” trials.

Long-term memory

Delayed free recall

Participants were presented with 10 lists of 10 words each. Each word appeared for 1 s each, separated by a 500-ms blank interval. At the end of the list, participants were given math problems to solve for 16 s. Then, they were given 45 s to recall as many words from the previous list as possible. The dependent variable was the average proportion of words recalled per list.

Picture source-recognition

Participants were presented with images and asked to remember their spatial location. During the study phase, 30 images were presented for 3 s each, separated by a 500-ms blank interval. Images appeared in one of four screen quadrants (top left, top right, bottom left, bottom right). During the test phase, participants were presented with 60 images (the 30 old images and 30 new images). Participants were asked to report whether the item was new or old. And, if the image was old, the participant needed to report the quadrant in which it was presented. Participants used the number pad on the keyboard to make their responses (1 = old image from bottom left, 3 = old image from bottom right, 7 = old image from top left, 9 = old image from top right, 5 = new image). Due to a programming error, accuracy was only correctly recorded for the new images. Therefore, the dependent variable was accuracy on the new trials (i.e., correct rejections).

Cued recall

Participants were presented with five lists of 10 unrelated cue–target word pairs (e.g., horse–gift). Each pair was presented for 2 s followed by a 1-s blank interval. Then, during the test phase, participants were presented with the cue word (e.g., horse–???) and asked to recall the target word it was paired cue during study. Participants were given a maximum of 5 s to recall each target before the next cue was presented. Cues were presented in a different random order than during study. The dependent variable was the average proportion of correctly recalled target words across the five lists.

Fluid intelligence

Raven advanced progressive matrices (Raven & Court, 1962)

On each trial, participants were presented with a 3 × 3 grid of patterns. The bottom-right piece to each pattern was missing. Below the grid, eight possible solutions were provided. The participants’ task was to select the solution that best completed an implicit pattern in the grid. Participants completed the 18 odd-numbers problems. Participants were given 10 min to solve as many problems as possible. The dependent variable was the total number of correctly solved problems.

Number series (Thurstone, 1938)

On each trial, participants were presented with a sequence of numbers. The task was to select from a set of 5 possible options the number that best continued an implicit pattern in the sequence. Participants were given 4.5 minutes to solve as many of 15 items as possible. The dependent variable was the total number of correctly solved problems.

Letter sets (Ekstrom & Harman, 1976)

On each trial, participants were given five sets of four letters. The task was to select the letter set that did not follow a rule present in the other items. Participants were given 5 minute to solve as many of 20 possible problems as possible. The dependent variable was the total number of correctly solved problems.

Data analysis

The data were aggregated in R with the tidyverse (Wickham et al., 2019), papaja (Aust, 2023; Aust & Barth, 2018), and data.table (Dowle & Srinivasan, 2021) set of packages, plotted using ggplot (Wickham, 2016), cowplot (Wilke et al., 2019), and ggrain (Allen et al., 2021), and analyzed with the lavaan (Rosseel, 2012) and rstatix (Kassambara, 2020) packages. The analysis script is available on the Open Science Framework (https://osf.io/qe2kw/). To account for missing data in the latent-variable model fitting, we used maximum-likelihood estimation, which allows all available pairwise relations to inform the variance-covariance matrix to which the model is fit.

Exclusions

We used an outlier detection threshold of 2.5 standard deviations outside each variable’s mean to remove extreme values and ensure multivariate normality. Any value falling outside this range was set to missing for the analysis. Proportions of missing/excluded data for each variable are listed in Table 6.Footnote 5

Results

Zero-order correlations among the measures for the full sample are listed in Table 2, and then listed by condition in Tables 3, 4, and 5. Descriptive statistics for the full sample are listed in Table 6. The distributions of task performance by condition are shown in Fig. 2. In our first set of comparisons, we tested for differences in the zero-order correlations between conditions. The comparisons were conducted using Fisher’s r to z transformation, then performing Fisher’s (1925) test for a significant difference between correlations measured in independent samples. Six out of 198 comparisons reached significance at p < .05. However, about 10 (198 × 0.05) differences would be expected by chance. No correlations reached the critical threshold of p < .001 (see Supplemental Materials for tables of p values for each comparison). Footnote 6,Footnote 7 Therefore, the comparisons of zero-order correlations did not suggest a systematic strengthening or weakening of the correlations, either within or across constructs, based on task sequencing Table 7.

Table 2 Zero-order correlations among measures in full sample
Table 3 Zero-order correlations among measures in grouped condition
Table 4 Zero-order correlations among measures in interleaved condition
Table 5 Zero-order correlations among measures in random condition
Table 6 Descriptive statistics in full sample
Fig. 2
figure 2

Task performance by condition

Table 7 Factor loadings and interfactor correlations for each condition

Measurement invariance

Our next set of analyses examined whether task sequencing had any impact on the factor structure. First, we specified a confirmatory factor analysis with the operation span, symmetry span, and reading span tasks loading onto a Working Memory factor, antisaccade, psychomotor vigilance, and SART loading onto an Attention Control factor, delayed free recall, picture source-recognition, and cued recall loading onto a Long-Term Memory factor, and Raven, number series, and letter sets loading onto a Fluid Intelligence factor (see Fig. 3 for a visualization of the factor structure). This model fit the data well, χ2(48) = 135.68, CFI = 0.94, TLI = 0.91, RMSEA = 0.055 90% CI [0.045, 0.067], SRMR = 0.04. Next, we reestimated the model, allowing the factor loadings and interfactor correlations to be estimated for each condition individually. Table 8 lists factor loadings and interfactor correlations by condition. The model fit the data well in in the grouped condition, χ2(48) = 63.18, CFI = 0.96, TLI = 0.95, RMSEA = 0.042 90% CI [0.00, 0.068], SRMR = 0.05 and the interleaved condition, χ2(48) = 64.67, CFI = 0.96, TLI = 0.95, RMSEA = 0.040 90% CI [0.00, 0.063], SRMR = 0.05, but fit slightly worse in the random condition, χ2(48) = 84.29, CFI = 0.91, TLI = 0.88, RMSEA = 0.06 90% CI [0.04, 0.08], SRMR = 0.05.

Fig. 3
figure 3

Factor loadings and interfactor correlations for A) full sample, B) grouped condition, C) interleaved condition, and D) random condition. All parameters are standardized, and all were significant at p < .05

Table 8 Results of model comparisons fixing individual factor loadings

Factor loadings

Next, we added an equality constraint to the factor loadings. Our first test compared the grouped condition to the combination of the interleaved and random conditions. We added a specification to the model that constrained all factor loadings to be equal across those two groups. Doing so produced a significantly worse-fitting model, according to the comparison of χ2 fit indices, Δχ2(8) = 18.06, p = .02. However, a Bayes factor comparison heavily favored a simpler model in which the factor loadings were fixed across groups (BF > 100,000). Thus, there was evidence against a difference in factor loadings. Next, we added the same equality constraint on the loadings across all three conditions. This model also fit significantly worse than the freely estimated model, based on a χ2 comparison, Δχ2(16) = 35.54, p = .001. However, a Bayes factor comparison heavily favored a simpler model in which all factor loadings were fixed, BF > 100,000. To delve into any specific difference(s), we iteratively compared models by fixing one factor loading at a time to be equal across the grouped and interleaved/random conditions, then between each condition individually. There is only one degree of freedom in these model comparisons, which allowed us to test whether a specific factor loading differed across conditions. Because there were 48 comparisons, we adjusted our α level for these tests to 0.001 (0.05/48). At this threshold, only one comparison produced a significantly worse model when the parameter was fixed: the loading for cued recall onto the Long-Term Memory factor was higher in the grouped condition than in the other two conditions (see Table 8). We also compared the loadings via Bayes factors. Again, there was only substantial evidence for a difference in the picture source-recognition loading between the grouped and ungrouped conditions. In almost all other cases, there was substantial evidence (BF10 > 3) against a difference (see Table S2). Thus overall, we found evidence against Hypothesis 1—that administering the tasks in a fixed sequence that groups them by construct would inflate their factor loadings.

Interfactor correlations

Our next test of measurement invariance examined whether the conditions systematically decreased the magnitude of the interfactor correlations. We did not have a hypothesis for a difference between the interleaved and random conditions. Therefore, our first test compared the grouped condition to the interleaved and random conditions combined. We specified an equality constraint on the latent covariances. Doing so did not significantly worsen fit, Δχ2(6) = 11.30, p = .08. A Bayes factor comparison heavily favored a simpler model in which all latent covariances were set to be equal (BF > 100,000). For completeness, we then performed the full condition-wide comparison, fixing the interfactor correlations to be equal across all three conditions. Again, this did not yield a significant worsening of model fit, Δχ2(12) = 16.73, p = 0.16, and the Bayes factor comparison heavily favored a simpler model in which all latent correlations were fixed across the three conditions (BF > 100,000). Therefore, we also found evidence against Hypothesis 2—that sequencing the tasks by construct would systematically decrease the interfactor correlations. Although the interleaving and randomization did appear to increase some latent correlations (see Table 8), there was not a systematic decrease in latent correlations, as hypothesized. Figure 3 shows scatterplots of interfactor correlations by condition.

As a final test of our hypothesis, we used the data from the interleaved condition to test whether accounting for temporal proximity would improve model fit. As a reminder, the task sequence in the interleaved condition was the same for all participants. The task order was operation span, antisaccade, delayed free recall, Raven, symmetry span, psychomotor vigilance, picture source-recognition, letter sets, reading span, SART, cued recall, number series. First, we estimated the confirmatory factor analysis. Then, we allowed the residual variances from neighboring tasks (e.g., operation span and antisaccade, antisaccade and delayed free recall) to correlate. Doing so did not improve model fit, Δχ2(11) = 9.36, p = .59, and a Bayes factor comparison heavily favored a simpler model without adding residual covariances between adjacent tasks (BF > 100,000). Therefore, we again did not find evidence that measures delivered near each other in time systematically shared variance Fig. 4.

Fig. 4
figure 4

Scatterplots of interfactor correlations. Points and lines of best fit are plotted separately for each condition. (Color figure online)

Mean differences

Although we did not have any specific hypotheses regarding mean differences across conditions, we submitted the data to one-way analyses of variance (ANOVAs) with a between-subjects factor for condition. Because we estimated 12 ANOVAs, we adjusted our α level to correct for multiple comparisons (0.05/12 = 0.004). At this threshold, no ANOVAs indicated significant differences in mean performance (see Table 9).

Table 9 ANOVAs on dependent variables by condition

Next, we used factor analysis to compare construct-level means. The factor scores were saved for each participant. Factor scores were normally distributed with |skew| values < 1 and |kurtosis| values < 1.50. The scores were submitted to one-way ANOVAs with a between-subjects factor of condition. There were no significant differences in average factor scores across conditions: Working Memory: F(2, 591) = 0.58, p = 0.56, η2 = 0.002, BF01 = 333.28; Attention Control: F(2, 591) = 1.95, p = 0.14, η2 = 0.007, BF01 = 84.55; Long-Term Memory: F(2, 591) = 0.58, p = 0.56, η2 = 0.002, BF01 = 330.45; Fluid Intelligence: F(2, 584) = 2.75, p = 0.07, η2 = 0.009, BF01 = 38.11; see Fig. 5.

Fig. 5
figure 5

Raincloud plots of distributions of factor scores by condition (see online article for a color version of this figure).

Discussion

The present study was motivated by the observation that differential cognitive psychologists rarely randomize the sequencing of tasks in a latent variable analysis, which violates the principles of randomization and counterbalance in experimental psychology. Indeed, we often encounter a critique that our latent-variable designs are confounded by delivering tasks in fixed orders. This has been a concern of considerable deliberation in other fields, such as survey design (Buchanan et al., 2018; Loiacono & Wilson, 2020; Schell & Oswald, 2013; Wilson & Lankton, 2012; Wilson et al., 2017, 2021). Our goal in the present study was to test whether task sequencing systematically affects the latent factor structure of cognitive abilities. This is a nontrivial issue, as factor specification and correlations among factors are used to test theories regarding the structure of cognition.

We tested two hypotheses for how grouping cognitive tasks would affect the factor structure: compared with interleaved and randomized task sequences, grouped task sequences would (1) inflate factor loadings and (2) constrict interfactor correlations. Both effects were hypothesized to occur because of shared temporal variance being conflated with “true” shared variance – that due to a common underlying cognitive ability. Overall, the data did not show any systematic effects of task sequencing on the factor structure. Although the χ2 comparisons yielded significant worsening of model fit by fixing factor loadings, a Bayes factor comparison heavily favored a more parsimonious model in which all factor loadings were equal across all conditions. This was evidence against Hypothesis 1—that organizing the task sequence by construct would increase factor loadings. Further, there were no differences across the conditions in the magnitude of the interfactor correlations. Therefore, we cannot conclude that task sequencing is a significant moderator of the coherence of putative measures of a cognitive ability within a factor, nor that it moderates the strength of correlations among latent factors.

There are a few limitations worth mentioning. First, we gave the 12 tasks on the same day during a single 2-hr session. Therefore, the temporal grouping was still quite narrow. It is not uncommon for large batteries to be completed across multiple days. In that case, the shared temporal context for same-day tasks versus different-day tasks might be much stronger. Therefore, the present results may not generalize to situations in which measures of specific constructs are measured on different days of administration. This could be a future extension of the present study. Second, the scope of cognitive abilities measured was also narrow. Therefore, future work may need to perform similar assessments on related cognitive abilities like processing speed, crystallized intelligence, creativity, and problem-solving. Third, we specified our sample sized based on what we estimated would be sufficient to estimate latent factors for a single group (N = 200). The power of measurement invariance tests is affected by several factors including sample size, task/item communality, and factor determination (Meade & Bauer, 2007). Some of our tasks, particularly the picture source-recognition task, had relatively low factor loadings (likely due to the programming error which only correctly scored “new” items). Therefore, higher sample size and greater interrelatedness of measures within a construct are more likely to yield measurement invariance. Finally, the study was conducted entirely in a university sample, albeit a diverse one. Future work may need to test for these invariances more systematically with larger and more diverse (i.e., a blend of university and community) samples, more highly correlated manifest variables, and a larger array of cognitive factors.

Ultimately, the present study yields an important, unanswered question: which task sequence is best? Grouped, interleaved, or random? Although we did not observe a systematic effect of sequencing on the factor structure, we would recommend an interleaved task sequence. True randomization is difficult and imposes an administrative burden on the researcher. Fixed task sequencing provides the added benefit of exposing all participants to the same experimental conditions, as argued by Miyake et al. (2000). Unlike experimental approaches, which typically seek to minimize between-subject variability outside the specific manipulation and avoid systematic confounds (such as time), the individual-differences approach seeks to maximize interindividual variability while minimizing the variability with which participants experience the tasks. Therefore, the individual-differences research seeks to reduce any sources of noise in the measurement that are not “true” interindividual variance in the measures (e.g., task order, time-of-day, light/sounds conditions, task strategies). Therefore, one potential source of noise—the time at which a participant completes a task within a session – can be controlled by fixing the task sequence for all participants. The interleaved design thus represents a nice balance between pragmatics and precision. Regardless, the latent variable approach might be resistant to sequencing effects specifically because it models out measurement noise and estimates latent factors via systematic covariance among putative measurements of a construct.

Conclusions

Variations in task sequences for latent variable analyses of cognitive abilities do not systematically affect average performance or latent factor structures. Although task sequencing did not have a systematic effect here, we recommend a best practice of fixing and interleaving measures of respective constructs to reduce systematic interindividual noise and maximize the likelihood of observing true interindividual variation.