Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views32 pages

Paper 3 (For ML)

This paper presents a model-free deep reinforcement learning approach to develop adaptive learning systems that personalize learning materials based on learners' continuous latent traits. By formulating the adaptive learning problem as a Markov decision process and employing a deep Q-learning algorithm, the authors aim to optimize learning policies without requiring extensive historical data. Additionally, a transition model estimator is introduced to enhance the algorithm's efficiency, allowing for effective learning policy discovery even with limited learner data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views32 pages

Paper 3 (For ML)

This paper presents a model-free deep reinforcement learning approach to develop adaptive learning systems that personalize learning materials based on learners' continuous latent traits. By formulating the adaptive learning problem as a Markov decision process and employing a deep Q-learning algorithm, the authors aim to optimize learning policies without requiring extensive historical data. Additionally, a transition model estimator is introduced to enhance the algorithm's efficiency, allowing for effective learning policy discovery even with limited learner data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

arXiv:2004.08410v1 [cs.

LG] 17 Apr 2020

Xiao Li

department of educational psychology


university of illinois at urbana-champaign

Hanchen Xu

department of electrical and computer engineering


university of illinois at urbana-champaign

Jinming Zhang

department of educational psychology


university of illinois at urbana-champaign

Hua-hua Chang

department of educational studies


purdue university

April 21, 2020


Psychometrika Submission April 21, 2020 2

DEEP REINFORCEMENT LEARNING FOR ADAPTIVE LEARNING SYSTEMS

Abstract

In this paper, we formulate the adaptive learning problem—the problem of how to


find an individualized learning plan (called policy) that chooses the most appropriate
learning materials based on learner’s latent traits—faced in adaptive learning systems
as a Markov decision process (MDP). We assume latent traits to be continuous with an
unknown transition model. We apply a model-free deep reinforcement learning
algorithm—the deep Q-learning algorithm—that can effectively find the optimal
learning policy from data on learners’ learning process without knowing the actual
transition model of the learners’ continuous latent traits. To efficiently utilize available
data, we also develop a transition model estimator that emulates the learner’s learning
process using neural networks. The transition model estimator can be used in the deep
Q-learning algorithm so that it can more efficiently discover the optimal learning policy
for a learner. Numerical simulation studies verify that the proposed algorithm is very
efficient in finding a good learning policy, especially with the aid of a transition model
estimator, it can find the optimal learning policy after training using a small number of
learners.

Key words: adaptive learning system, transition model estimator, Markov decision
process, deep reinforcement learning, deep Q-learning, neural networks, model-free
Psychometrika Submission April 21, 2020 3

Introduction

In a traditional classroom, a teacher uses the same learning material (e.g. textbook,
instruction pace, etc.) for all students. However, the selected material may be too hard for some
students and too easy for some other students. Further, some students may take longer time in
learning than the others. Such a learning process may not be efficient. These issues can be solved
if the teacher can make an individualized learning plan for each individual student: Select an
appropriate learning material according to each student’ ability and let a student learn at her/his
own pace. Considering that a very low teacher-student ratio is required, such an individualized
adaptive learning plan may be too expensive to be applied to all students. As such, adaptive
learning systems are developed to provide individualized adaptive learning for all
students/learners. In particular, with the fast growth of digital platforms, globally integrated
resources, and machine learning algorithms, the adaptive learning systems are becoming
increasingly more affordable, applicable, and efficient (Zhang and Chang, 2016).
An adaptive learning system—also referred to as a personalized/individualized learning or
intelligent tutoring system—aims at providing a learner with optimal and individualized learning
experience or instructional materials so that the learner can reach a certain achievement level in a
shortest time or reach as high as possible an achievement level in a fixed period of time. First,
learners’ historical data are used to estimate her/his proficiency. Then, according to the level of
her/his proficiency, the system selects the most appropriate learning material for the learner.
After the learner finishes the learning material, an assessment is given to the learner and her/his
proficiency level is updated and is used by the adaptive learning system to choose the next most
appropriate learning material for the learner. Such process repeats until the learner achieves a
certain proficiency level.
In previous studies, the proficiencies or latent traits were typically characterized as vectors of
binary latent variables (Chen et al., 2018b; Li et al., 2018; Tang et al., 2019). However, it is
important to consider the granularity of the latent traits in a complicated learning and assessment
environment in which a knowledge domain consists of several fine-grained abilities. In some cases,
it would be too simple to model learners’ abilities as mastery or non-mastery. For example, when
Psychometrika Submission April 21, 2020 4

an item is designed to measure several latent traits and a learner regarded as mastering all related
traits of the item cannot be assured to answer the item correctly. A possible reason is that the
so-called mastery is not full mastery of a latent trait. By measuring learners’ traits as continuous
scales, the adaptive learning system can be designed to help learners to learn and improve until
they reach the target levels of certain abilities so that the learners can achieve target scores in
assessments. Especially in practice, most assessments are designed to measure learners’ latent
traits (McGlohen and Chang, 2008). In such scenarios, it is better to use a continuous scale to
measure the latent traits as the item response theory (IRT) does. In this paper, we will develop
an adaptive learning system that estimate learners’ abilities using measurement models in order
to provide them with most appropriate materials for further improvements.
Existing research studies have focused on modeling learners’ learning paths (Chen et al.,
2018a; Wang et al., 2018), accelerating learners’ memory speed (Reddy et al., 2017), providing
model-based sequence recommendation (Chen et al., 2018b; Lan and Baraniuk, 2016; Xu et al.,
2016), tracing learners’ concept knowledge state transitions over time (Lan et al., 2014), and
selecting materials for learners optimally based on model-free algorithms (Li et al., 2018;
Tang et al., 2019). However, explicit models are typically needed to characterize learners’ learning
progresses in these studies. While there exist research studies that aim to find the optimal
learning strategy/plan (called policy in the rest of the paper) which chooses the most appropriate
learning materials for learners using model-free algorithms, they all assume discrete latent traits.
In addition, when the number of learners is too small for the system to learn an optimal policy,
these algorithms are not applicable. This paper studies the important, yet less addressed adaptive
learning problem—the problem of finding the optimal learning policy—based on continuous latent
traits, and applies machine learning algorithms to deal with the tackle challenges such as only a
small number of learners available in historical data.
In this paper, we formulate the adaptive learning problem as a Markov decision process
(MDP), in which the state is the (continuous) latent traits of a learner, the action is the (discrete)
learning material given to the learner. Yet, the state transition model is unknown in practice,
thus making the MDP unsolvable using conventional model-based algorithms such as the value
iteration algorithm (Sutton and Barto, 2018). To solve the issue, we apply a model-free deep
Psychometrika Submission April 21, 2020 5

reinforcement learning (DRL) algorithm, the so-called deep Q-learning algorithm, to search for
the optimal learning policy. The model-free DRL algorithm is a class of machine learning
algorithms that solve an MDP by learning an optimal policy represented by neural networks from
a sequence of state transitions directly when the transition model itself is are unknown
(François-Lavet et al., 2018). DRL algorithms have been widely applied in solving a variety of
problems in many different fields such as playing Atari games (Mnih et al., 2015), bidding and
pricing in electricity market (Xu et al., 2019), manipulating robotics (Gu et al., 2017), and
localizing objects (Caicedo and Lazebnik, 2015). We refer interested readers to
François-Lavet et al. (2018) for a detailed review on the theories and applications of DRL.
Therefore, the adaptive learning system is embedded with the well-developed measurement
models and the model-free DRL algorithm so as to be more flexible.
However, a deep Q-learning algorithm typically requires a large amount of state transition
data so as to find an optimal policy, which may be difficult to obtain in practice. To cope with
the challenge of insufficient state transition data, we develop a transition model estimator that
emulates the learner’s learning process using neural networks. The transition model that is fitted
using available historical transition data can be used in the deep Q-learning algorithm to further
improve its performance with no additional cost.
The purpose of this paper is to develop a fully adaptive learning system in which (i) the
learning material given to a learner is based on her/his continuous latent traits that indicate the
levels of certain abilities, and (ii) the learning policy that maps the learner’s latent traits to the
learning materials is found adaptively with minimal assumption on the learners’ learning process.
First, an MDP formulation for the adaptive learning problem by representing latent traits in a
continuum is developed. Second, a model-free DRL algorithm—the deep Q-learning algorithm—is
applied, to the best of our knowledge, for the first time, in solving the adaptive learning problem.
Third, a neural network based transition model estimator is developed, which can greatly improve
the performance of the deep Q-learning algorithm when the number of learners is inadequate.
Last, some interesting simulation studies are conducted to serve as demonstration cases for the
development of adaptive learning systems.
The remainder of this paper is organized as follows. In the Preliminaries section, we briefly
Psychometrika Submission April 21, 2020 6

review measurement models and make some assumptions on the adaptive learning problem. In
the Adaptive Learning Problem section, we introduce the conventional adaptive learning systems
and develop a MDP formulation for the adaptive learning problem. Then, we apply the deep
Q-learning algorithm to solve the MDP in the Optimal Learning Policy Discovery Algorithm
section, where a transition model estimator that emulates the learners is also developed. Two
simulation studies are conducted in the Numerical Simulation section and some concluding
remarks are made at the end of the paper.

Preliminaries

In this section, we give a brief introduction on measurement models for continuous latent
traits, which is an important component in adaptive learning systems. The representation of
learners’ latent traits and assumptions on them are also presented.

Measurement Models

In an adaptive learning system, a test is given to a learner/student after each learning cycle.
The learner’s responses to the test items are collected by the system and her/his latent traits are
estimated using measurement models, specifically IRT models (Rasch, 1960; Lord et al., 1968).
An appropriate IRT model needs to be chosen based on the test’s features such as the test’s
dimensional structure (Zhang, 2013) and its response categories. To be more specific, in the case
when item responses are recorded as binary values indicating correct or incorrect answers, the test
that evaluates only one latent trait will use the unidimensional item response theory IRT models
(Rasch, 1960; Birnbaum, 1968; Lord, 1980), whereas tests that associate more than one trait will
use the multidimensional item response theory (MIRT) models (Reckase, 1972; Mulaik, 1972;
Sympson, 1978; Whitely, 1980). When item responses have more than two categories, polytomous
IRT models such as the partial credit model (Masters, 1982), the generalized partial credit model
(Muraki, 1992), and the graded response model (Samejima, 1969) are used for unidimensional
case. Their extensions can be applied in multidimensional cases.
Psychometrika Submission April 21, 2020 7

The basic representation of an IRT model is expressed as

P(U = u|θ) = f (θ, η, u), (1)

where P denotes probability, U is a random variable representing the score on the test item, u is
the possible value of U , θ is a vector of parameters describing the learner’s latent traits, η is a
vector of parameters indicating the characteristic of the item, and f denotes a function that maps
θ, η, u to a probability in [0, 1]. As pointed out in Ackerman et al. (2003), many educational tests
are inherently multidimensional. Therefore, we will use the MIRT as the intrinsic model to build
up the adaptive learning system. As an illustration, the multidimensional two-parameter logistic
IRT (M2PL) model is given by

eaj θi +dj
P(Uij = 1|θi , aj , dj ) = ⊤ θ +d
, (2)
1 + eaj i j

where Uij is the response given by ith test taker to j th item, θi = [θi1 , θi2 , · · · , θiD ]⊤ is a vector in
RD describing a set of D latent traits, aj is a vector of D discrimination parameters for the j th
item, indicating the relative importance of each trait in correctly answering the item, and the
intercept parameter dj is a scalar for item j. An applicable item j takes each element of aj to be
nonnegative. Therefore, as each element’s value of θi increases, the probability of correct response
increases.
With an online calibration design, an accurately calibrated item bank can be acquired using
previous learners’ response data for an adaptive learning system without large pretest subject
pools (Makransky and Glas, 2014; Zhang and Chang, 2016). After item parameters are
pre-calibrated, a variety of latent trait estimation methods can be applied to estimate learners’
abilities. Conventional methods such as maximum likelihood estimation (Lord et al., 1968),
weighted likelihood estimation and Bayesian methods (e.g. expected a posteriori estimation
(EAP), maximum a posteriori (MAP)) can accurately estimate latent traits in MIRT models.
Their variations are also extended for estimating the latent traits in multiple dimensions. Many
latent trait estimation methods result in a bias on the order of as small as O(n−1 ), where n
denotes test length, while approaches that further reduce the bias as well as the variance of
Psychometrika Submission April 21, 2020 8

estimates have also been identified and proposed (Firth, 1993; Tseng and Hsu, 2001; Wang, 2015;
Warm, 1989; Zhang et al., 2011).

Assumptions
(t) (t)
Denoted θ (t) = [θ1 , · · · , θD ]⊤ as learner’s latent traits at time step t, where D is the number
of dimensions. Throughout this paper, we make the following simplifying yet practical
assumptions:

(t+1) (t)
A1. No retrogression exists in latent traits. That is, θd ≥ θd , ∀d ∈ {1, · · · , D}.

A2. The number of learning materials is finite.

Adaptive Learning Problem

In this section, we first describe the adaptive learning problem and then formulate this
problem as an MDP.

Problem Statement

test Latent Traits


Learner responses Estimator

learning latent
material traits

Learning
Teacher
Policy

Figure 1.
Conventional adaptive learning system.

A conventional adaptive learning system is illustrated in Figure 1. Such an adaptive learning


system is typical in traditional classrooms and online courses like Massive Open Online Courses
(MOOCs) (Lan and Baraniuk, 2016). In the adaptive learning system, the learner takes some
learning materials to improve her/his latent traits. After the learner finishes learning the
materials, a test or homework is assigned to the learner. Then, the learner’s latent traits are
Psychometrika Submission April 21, 2020 9

estimated. Based on the estimated latent traits, the learning system adaptively determines the
next learning material for the learner, which may be one of many forms including a textbook
chapter, a lecture video, an interactive task, an instructor support, or an instruction pace. Such
cyclic learning process continues until the learner’s latent traits reach or are close to a prespecified
levels of proficiency.
The tests in an adaptive learning system can be computerized adaptive testing (CAT). The
CAT is a test mode that administers tests adapted to test takers’ trait levels (Chang, 2015). It
provides more accurate trait estimates with much smaller number of items (Weiss, 1982) by
sequentially selecting and administering items tailored to each individual learner. Therefore, a
relatively short test can assess learners’ latent traits with high accuracy.
Conventionally, the learning policy (or plan) is provided by a teacher as illustrated in Figure
1. As aforementioned, however, it is too expensive for teachers to make an individualized adaptive
learning policy for each learner. In this paper, we use a DRL algorithm to search for an optimally
individualized adaptive learning policy for each learner. The algorithm selects the most
appropriate learning material among all available materials for each learner based on her/his
provisional estimated latent traits that are obtained from her/his learning history and
performances in tests. The adaptive selection of learning materials guarantees the learner reaches
a prespecified proficiency level in a shortest number of learning cycles or reaches proficiency level
as high as possible in a fixed number of learning cycles. That is, instead of resorting to an
experienced teacher for the construction of a learning policy as illustrated in Figure 1, we will
develop a systematic method to enable the adaptive learning system to discover an optimal
learning policy from the data that have been collected, which include historical learning materials,
test responses, and estimated latent traits, etc.

Markov Decision Process Formulation

Primer on Markov Decision Process

Before presenting the formulation for the adaptive learning problem, we first briefly review
MDPs. An MDP is characterized by a 5-tuple (S, A, P, R, γ), where S is a set of states, A is a set
Psychometrika Submission April 21, 2020 10

of actions, P is a Markovian transition model, R : S × A × S → R is a reward function, and


γ ∈ [0, 1) is a discount factor (Sutton and Barto, 2018). A transition sample is defined as
(s, a, r, s′ ), where s, s′ ∈ S and a ∈ A, r = R(s, a, s′ ) is a scalar reward when the state
transitions into state s′ from state s after taking action a.
Let S (t) and A(t) denote the state and action at time step t, respectively, and R(t) denote the
reward obtained after taking action A(t) at state S (t) . Note that S (t) , A(t) , and R(t) are random
variables. When both S and A are finite, the transition model P can be represented by
conditional probability, that is,

P (t) (s′ |s, a) = P(S (t+1) = s′ |S (t) = s, A(t) = a). (3)

The Markovian property of the transition model is that, for any time step t,

P(S (t+1) |A(t) , S (t) , . . . , A(0) , S (0) ) = P(S (t+1) |A(t) , S (t) ). (4)

Essentially, the Markovian property requires that a future state is independent of all past states
given the current state. Assume P is time-homogeneous, i.e., for any two time steps t1 and t2 ,

P (t1 ) (s′ |s, a) = P (t2 ) (s′ |s, a). (5)

Then, we can drop the superscript t and write the transition model as P(s′ |s, a). Note that when
S is continuous, the transition model can be represented by a conditional probability density
function.
Let π : S → A denote a deterministic policy for the MDP defined above. The action-value
function for the MDP under policy π is defined as follows:
X∞
Qπ (s, a) = E[ γ t R(t) |S (0) = s, A(0) = a; π], (6)
t=0

where E denotes the expectation. The action-value function Qπ (s, a) is the expected cumulative
discounted reward when the system starts from state s, takes action a, and follows policy π
thereafter. The maximum action-value function over all policies is defined as
Q(s, a) = maxπ Qπ (s, a). A policy π is said to be optimal if Qπ (s, a) = Q(s, a) for any s ∈ S and
a ∈ A. In particular, the greedy policy with respect to Q(s, a), defined as
Psychometrika Submission April 21, 2020 11

π ∗ (s) = arg maxa Q(s, a), is an optimal policy (Sutton and Barto, 2018). The MDP is solved if
we find π ∗ .

Theorem 1. (Bertsekas and Tsitsiklis, 1996) The optimal action-value function Q(s, a)
satisfies the Bellman optimality equation:
X
Q(s, a) = E[R(0) ] + γ P(s′ |s, a) max

Q(s′ , a′ ). (7)
a ∈A
s′ ∈S

Furthermore, there is only one Q function that solves the Bellman optimality equation.

The Bellman optimality equation is of central importance to solving the MDP. When both S and
A are finite and P is known, model-based based algorithms such as the value iteration algorithm
can be applied to solve the MDP (Sutton and Barto, 2018).

Adaptive Learning Problem as MDP

We next formulate the adaptive learning problem as an MDP as follows.


State Space: Define the vector of parameters describing the learner’s latent traits as the state, i.e.,
s = θ, which has D continuous variables, where D represents the dimension of the latent traits.
For the simplicity of the algorithm construction in the following, the state space is defined as
S = [0, 1]D when each element of θ satisfies θ ∈ [0, 1], in which a smaller value of θ indicates a
lower ability and a larger value indicates a higher ability. Although a latent trait variable is
typically defined on R in IRT, a closed interval, say [−5, 5], is used as the range of a latent trait
variable in practice. Let hd be the prespecified target proficiency level of the dth latent trait,
which is the level the learners try to reach, where d = 1, . . . , D. Because of the fact that there is a
bijection between [−5, hd ] and [0, 1], an estimated trait θ ∈ [−5, hd ] can be directly transformed
into the scale of [0, 1]. Thus, without loss of generality, we consider the state space as S = [0, 1]D .
Action Space: Let the learning materials available in the adaptive learning system be indexed by
1, 2, · · · , L. The action a in the adaptive learning system is the index of a learning material,
which is discrete, and the action space is A = {1, · · · , L}.
Reward Function: Recall that the objective of the adaptive learning system is to minimize the
learning steps it takes before a learner’s latent traits reach the maximum, i.e., for θ to reach 1D ,
Psychometrika Submission April 21, 2020 12

where 1D is an all-ones vector in RD . As such, the reward function is defined as follows:



 −1, if ||s′ − 1D ||∞ ≥ 10−3 ,
r = R(s, a, s′ ) = (8)
 0, otherwise,

where || · ||∞ indicates the infinite norm. Intuitively, the sum of rewards over one episode (the
entire learning process of a learner) is to the negative of the total steps a learner takes before all
of her/his latent traits are very close to 1, which indicates that the learner has reached target
levels of all prespecified abilities.
Transition Model: The probability distributions of the latent trait as well as the change of trait
are unknown. As a result, the transition model P(s′ |s, a) is not known a priori.
Based on this MDP formulation, the adaptive learning problem is essentially to find an
optimal learning policy, denoted by π ∗ : S → A, that determines the action (learning material
selection) based on the state (latent traits), such that the expected cumulative discounted reward
is maximized. Note that the larger the expected cumulative discounted reward is, the less the
total learning steps a learner takes to reach the target level(s) of an ability/abilities is. Since the
transition model P is unknown, the MDP cannot be solved using model-based algorithms such as
the value iteration algorithm. We will resort to a model-free DRL algorithm to solve it in the next
section.

Optimal Learning Policy Discovery Algorithm

In this section, we solve the adaptive learning problem by using the deep Q-learning
algorithm, which can learn the action-value function directly from historical transition data
without knowing the underlying transition model. To utilize the available transition information
more efficiently, we further develop a transition model estimator and use it to train the deep
Q-learning algorithm.

Action-Value Function As Deep Q-Network

Recall that the optimal learning policy can be readily obtained if we know the action-value
function. When the state is continuous and the action is discrete, which is the case in the
Psychometrika Submission April 21, 2020 13

adaptive learning problem, the action-value function Q(s, a) cannot be exactly represented in a
tabular form. In such cases, the action-value function can be approximated by some functions,
such as linear functions (Sutton and Barto, 2018) or artificial neural networks (simply referred to
as neural networks) (Mnih et al., 2015). In the former case, the approximate action-value function
is represented as an inner product of the parameter vector and a feature vector that is
constructed from the state. It is important to point out the choice of the features is critical to the
performance of the approximate action-value function. Meanwhile, neural networks are capable of
extracting useful features from the state directly, and have stronger representation power than
linear functions (Goodfellow et al., 2016).

Input Layer Hidden Layer Output Layer

Figure 2.
An illustrative neural network with one hidden layer.

As an example for neural networks, Fig. 2 shows an illustrative neural network that consists
of an input layer that has 3 units, a hidden layer that has 4 units, and an output layer with 2
units. Let x = [x1 , x2 , x3 ]⊤ , h = [h1 , h2 , h3 , h4 ]⊤ , and y = [y1 , y2 ]⊤ denote the vectors that come
out of the input layer, the hidden layer, and the output layer, respectively. In the neural network,
the output of one layer is the input for the next layer. To be more specific, h can be computed
from x, and y can be computed from h as follows:

h = φ(Whx x + bh ), (9)

y = Wyh h + by , (10)

where Whx ∈ R4×3 and Wyh ∈ R2×4 are two weight matrices, bh ∈ R4 and by ∈ R2 are two bias
vectors, and φ(·) is the so-called activation function, which is applied to its argument
Psychometrika Submission April 21, 2020 14

element-wise. A popular choice of the activation function φ is the rectifier, i.e., φ(x) = max(x, 0).
Conceptually, we can write the output y as a function of y = ϕ(x), where ϕ(·) is parameterized
by Whx , Wyh , bh , and by , which can be collectively denoted as a parameter vector w. Given a set
of input-output values denoted by {(x(i) , y (i) ) : i = 1, · · · , M }, the optimal value of w can be
found by solving the following problem:
M
X
min ||ϕ(x(i) ) − y (i) ||2 , (11)
w
i=1

where || · || is the L2 -norm. Problem (11) can be solved by using gradient descent algorithm or its
variants, in which the gradient of the objective function with respect to w can be computed using
the famous backpropagation technique. Neural networks can also be trained using a variety of
other optimization algorithms such as Adam and RMSProp (see, Goodfellow et al., 2016). Note
that there may be several hidden layers and the more hidden layers there are, the deeper the
neural network is. We refer interested readers to Goodfellow et al. (2016) for a more
comprehensive details about neural networks.
Recall that in the adaptive learning problem, the state is continuous in [0, 1]D , while the
action is discrete A = {1, · · · , L}. The approximate action-value function, denoted by Q̂(s, a),
can be represented using a neural network as follows. The input layer is the state s, or
equivalently, the latent trait vector θ, which has D units. The output has L units, each of which
corresponds to the action-value for one action. To more be specific, the ℓth unit in the output
layer gives Q̂(s, a = ℓ), i.e., the action-value for state s and action ℓ. The number of hidden layers
and the number of units in each hidden layer can be determined through simulation, which is to
be detailed in the numerical simulation section. Such a neural network is also referred to as a
deep Q-network (DQN) (Mnih et al., 2013). Let w denote the parameter vector of the DQN,
which includes all weights and biases in the DQN. To emphasize that Q̂(s, a) is parameterized by
w, we write Q̂(s, a) as Q̂(s, a; w).
Once we have Q̂(s, a; w), the optimal learning policy becomes readily available, which is
π ∗ (s) = arg maxa Q̂(s, a; w). Then, the “Teacher” block in Figure 1 can be replaced with the
DQN as shown in Figure 3.
Psychometrika Submission April 21, 2020 15

test Latent Traits


Learner responses Estimator

learning latent latent


material traits traits

Learning
Deep Q-Network
Policy

Figure 3.
Adaptive adaptive learning system with DQN.

Learning Policy Discovery with Deep Q-Learning

The parameters of the DQN can be learned from the the sequence of latent traits and
learning materials using the deep Q-learning algorithm proposed by Mnih et al. (2013). The
optimal value of the parameter vector of the DQN, w, can be found by minimizing the mean
squared error between the approximate action-value function and the true action-value function:

min E[(Q̂(S, A; w) − Q(S, A))2 ]. (12)


w

However, solving (12) is extremely difficult if not impossible since both Q(S, A) and the
transition model are unknown and thus, the expectation of the mean squared error cannot be
computed. The deep Q-learning algorithm adopts two measures to cope with these challenges.
First, the expectation is replaced with the sample average that can be computed from a set of
historical transitions, denote by M = {(s, a, r, s′ ) : s, s′ ∈ S, a ∈ A}, with |M| = M , where | · |
denotes the cardinality of a set. That is, (12) is now replaced by the following problem:
X
min (Q̂(s, a; w) − Q(s, a))2 . (13)
w
(s,a,r,s′ )∈M

At time step t, the parameter vector is updated using the gradient descent algorithm as follows:
X ∂ Q̂(s, a; w (t) )
w (t+1) = w (t) − α (Q̂(s, a; w (t) ) − Q(s, a)) , (14)
∂w
(s,a,r,s′ )∈M

where α > 0 is the learning rate and w (t) denotes the value of w at time step w. Second, the
unknown Q(s, a) is further substituted by r + γ maxa′ Q̂(s′ , a′ ; w (t) ) based on the Bellman
Psychometrika Submission April 21, 2020 16

optimality equation in (7). Note that when ||s′ − 1D ||∞ < 10−3 , which indicates the learning
process has ended, Q(s′ , a′ ) = 0. Therefore, (14) is now becomes

X ∂ Q̂(s, a; w (t) )
w (t+1) = w (t) − α (Q̂(s, a; w (t) ) − y) , (15)
∂w
(s,a,r,s′ )∈M

where

 r, if ||s′ − 1D ||∞ < 10−3 ,
y= (16)
 r + γ max ′ Q̂(s′ , a′ ; w (t) ), otherwise.
a

The detailed deep Q-learning algorithm that is used to search the optimal parameter vector
for the DQN is presented in Algorithm 1, where one episode represents a complete learning
process of one learner and the number of episodes is the number of learners. In order to obtain a
good approximate action-value function, the state-action space needs to be sufficiently explored.
To achieve this, the so-called ǫ-greedy exploration is adopted in the deep Q-learning algorithm.
Specifically, at time step t, a random action a(t) is selected with probability ǫ(t) , and a greedy
action a(t) = maxa Q̂(s(t) , a; w (t) ) is with probability 1 − ǫ(t) . In this paper, we adaptively decay
ǫ(t) from ǫ to ǫ in τǫ time steps. In addition, the parameter vector is updated at each time step
using a set of transitions M that is resampled from the historical transitions denoted by H with
|H| = H so as to reduce the bias that may be caused by the samples.

Transition Model Estimator

The deep Q-learning algorithm requires a sufficiently large historical transition data in order
to find a good approximate of the action-value function, based on which the learning policy is
then derived. However, we may not be able to obtain adequate transitions due to several reasons
including the lack of adequate learners, and the long time it takes to acquire an individual
learner’s learning path (transitions). Thus, it is more desirable to develop an adaptive learning
system which can efficiently discover the optimal learning policy after training on a relatively
small number of learners. To this end, we develop a transition model estimator which emulates
the learning behavior of learners. Specifically, the transition model estimator can take a state s
and an action a as inputs, and output the next state s′ . This can be cast as a supervised learning
Psychometrika Submission April 21, 2020 17

task, (a regression task), which can be solved using neural networks. The input layer of the neural
network that represents the transition model is a pair of state and action, and the output layer is
the next state. The number of hidden layers can be adjusted through the parameter tuning
process (see, e.g., Goodfellow et al., 2016, for more details).
Conceptually, we can write the neural network that represents the transition model as
s′ = ψ(s, a), the parameter vector of which is denoted by v. The optimal value of v can be found
by solving the following problem using the backpropagation algorithm:
X
min ||ψ(s, a) − s′ ||2 , (17)
v
(s,a,r,s′ )∈H

where H is the set of historical transition (data).


The adaptive learning system with the DQN and a transition model estimator is shown in
Fig. 4, where the DQN is trained against the transition model, instead of the actual learners.

test Latent Traits latent Transition Model


Learner responses Estimator traits Estimator

learning latent latent learning


material traits traits material

Learning
Deep Q-Network
Policy

Figure 4.
Adaptive adaptive learning system with DQN and transition model estimator.

Numerical Simulation

In this section, we show the performance of the adaptive learning system with and without
the transition model estimator, and also investigate the impacts of latent trait estimation errors
through two simulation studies.

Simulation Overview

Consider a group of learners in a two-dimensional assessment and a learning environment


with three sets of learning materials. We model the group of learners as a homogeneous MDP. Let
Psychometrika Submission April 21, 2020 18

(t) (t)
the random vector Θ(t) = [Θ1 , Θ2 ]⊤ denote a learner’s state S (t) at time step t, which
represents the latent traits in our study. Consider three sets of learning materials regarding the
two-dimensional latent trait levels, that is, A = {1, 2, 3}. Each set of learning materials contain
contents with regards to different latent traits. Denote the change of the latent traits from time
(t) (t)
step t to t + 1 by ∆Θ(t) = [∆Θ1 , ∆Θ2 ]⊤ . The probability of having ∆θ = [∆θ1 , ∆θ2 ]⊤
transitioning from state θ = [θ1 , θ2 ]⊤ to θ ′ = [θ1′ , θ2′ ]⊤ can be represented as

P(θ ′ |θ, a) = P(∆Θ(t) = ∆θ|Θ(t) = θ, A(t) = a), (18)

where a is the index of the set which the selected learning material belongs to. In the following
notations, we only consider the set which the selected learning material belongs to, denoted as a.
Assume θ1 , θ2 ∈ [0, 1], where the value of 0 indicates extremely low ability on the corresponding
dimension and the value of 1 indicates the target ability.
In addition, under Assumption A1 of no retrogression, we have ∆θ1 ∈ [0, 1 − θ1 ] and
∆θ2 ∈ [0, 1 − θ2 ]. As we model the transition of the latent traits to be a continuous-state MDP,
the change of ∆θ1 and ∆θ2 only depends on current latent trait θ and the selected learning
material a. Therefore, we let ∆θ1 and ∆θ2 follow Beta distributions such that
∆θ1 ∼ Beta(1, g1 (θ, a)), where a ∈ {1, 3}, and ∆θ2 ∼ Beta(1, g2 (∆θ1 , θ, a)), where a ∈ {2, 3}.
∆θ2 = 0 when a = 1 and ∆θ1 = 0 when a = 2, which means the first set of materials only helps
improving θ1 while the second set is only related to θ2 . Parameters of g1 (θ, a) and g2 (∆θ1 , θ, a)
in the Beta distribution are calculated by

 3 + 8θ1 − 0.2θ2 , a=1
g1 (θ, a) = (19)
 15 + 15θ − 0.4θ , a = 3
1 2

and

 10 − θ1 + 5θ2 , a=2
g2 (θ, a) = 2 (20)
 20 − 28θ e− (θ1 −0.6)
0.3 + 30θ2 − 0.3∆θ1 , a = 3.
1

An intuitive example is how a learner learns addition “+” and subtraction “–”. A learning
process usually takes a long time and thus a monotonic decreasing, zero-concentrated distribution
is adopted to simulate the ability increase. In that case, each learning step will most likely lead to
Psychometrika Submission April 21, 2020 19

a small increase of the ability/abilities. Besides, in the distribution Beta(1, b), the larger b is, the
more the curve approaches 0, which results in a higher chance in generating a smaller ∆θ. It
implies that a higher ability the learner has on either dimension, the harder for him/her to
further improve the corresponding ability. Thus, g1 (θ, a) and g2 (∆θ1 , θ, a) have positive
coefficients in front of θ1 and θ2 , respectively. Meanwhile, we assume that a higher ability on one
dimension helps to increase the other dimension’s ability, which results in a negative coefficient
ahead of θ2 in g1 (θ, a) and a negative coefficient ahead of θ1 in g2 (∆θ1 , θ, a). In particular,
assume the third learning material contains contents related to both abilities, and especially helps
learners with intermediate or high ability level of addition to improve further on subtraction. This
assumption is included in calculating g2 (θ, a) when a = 3 in equation (20). In addition, if the
learner makes a big progress in mastering the ability of addition, there is a higher chance for the
one to improve more on learning subtraction. Thus, the coefficient of ∆θ1 in g2 (∆θ1 , θ, a) is
negative which gives a curve that is less zero-concentrated as ∆θ1 increases. Consequently, ∆θ2
has a higher possibility in increasing more as ∆θ1 is large. Note that the transition model is not
required for adaptive learning system. The simulation gives an example in validating the
model-free deep Q-learning algorithm in discovering the optimal learning policy.
Estimation errors ranging from 1% to 15% are also added to estimated latent traits to
evaluate their impacts on the adaptive learning system. Denote the estimation error vector by
e = [e1 , e2 ]⊤ , where e1 and e2 are generated by the same normal distribution such that
e1 , e2 ∼ N (0, σ 2 ). As a result, 99.7% of e1 , e2 lie in the range of (−3σ, 3σ). In the simulation, the
estimated latent traits are calculated by the sum of the true latent traits and the estimation
errors, which are [θ1 + e1 , θ2 + e2 ]⊤ . For instance, if the standard deviation σ is 0.03, the
observation is [θ1 + e1 , θ2 + e2 ]⊤ , where e1 , e2 ∼ N (0, 0.032 ), and 99.7% of e1 , e2 lie in the range of
(−0.09, 0.09).
Two simulations cases are studied. In the first case, the DQN is trained against actual
learners whose abilities’ changes follow the MDP with kernel distributions described above. In
this case, it is presumed that the optimal learning policy can be trained on sufficient number of
learners. The resulting optimal learning policy is compared with a heuristic learning policy, which
selects the next learning material that can improve the not-fully-mastered ability, and a random
Psychometrika Submission April 21, 2020 20

learning policy which selects any material randomly from the set of three. The impact of different
estimation errors is also assessed. In the second case, the DQN is trained against an estimated
transition model learning that is obtained using a small group of learners. The resulting optimal
learning policy is compared with that obtained by training against actual learners.

Simulation Study I

−15
Episode Reward

−20

−25

−30 Smoothed Reward

0 200 400 600 800 1000 1200 1400


Episode

Figure 5.
Smoothed rewards under the deep Q-learning algorithm.

Assume all learners are beginners on the two latent traits when using the adaptive learning
system, i.e. Θ(0) = [0, 0]⊤ . The DQN has two hidden layers, the first of which has 64 units and
the second of which has 32 units. The DQN is trained against 2000 learners that are simulated
according to the method discussed earlier, i.e. E = 2000. Other parameters are chosen as follows:
γ = 0.9, α = 6 × 10−4 , ǫ = 1.0, ǫ = 0.1, τǫ = 2000, M = 256. The Adam algorithm is adopted for
the training of the DQN.
Figure 5 presents the smoothed reward under the deep Q-learning algorithm across the first
1500 episodes with a smoothing window of 20. It can be seen that the reward converges to −15
after 600 episodes, which indicates the optimal learning policy is found after the DQN is trained
using 600 learners.
Psychometrika Submission April 21, 2020 21

−10

Episode Reward −15

−20

−25
DQN Smoothed Reward
−30 Heuristic Smoothed Reward
Random Smoothed Reward
−35
25 50 75 100 125 150 175 200
Episode

Figure 6.
Smoothed rewards under DQN, heuristic, and random learning policies.

Figure 6 and Table 1 compare smoothed rewards across 200 new learners, labeled as episodes
in Figure 6, with a smoothing window of 20 between the optimal learning policy found by the
deep Q-learning algorithm after being trained in 2000 episodes—referred to as the DQN learning
policy, the heuristic learning policy, and the random learning policy. The larger the reward is, the
fewer learning steps a learner takes to fully master the two latent traits, or in another word, the
better the learning policy is. Clearly, the rewards obtained by the deep Q-learning algorithm have
a higher mean and smaller standard deviation (SD) than those obtained by the heuristic learning
policy and the random learning policy. These results show that the learning policy found by the
deep Q-learning algorithm is much better than the other two.
Psychometrika Submission April 21, 2020 22

1.0
State Path

0.8

0.6
θ2

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0


θ1

Figure 7.
An example of state transition path with action sequence of 1, 1, 1, 1, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2.

Figure 7 presents an example of a state transition path that shows how the latent traits
change with a sequence of actions taken under the DQN learning policy obtained without
considering estimation error. Take the addition and subtraction test as an example. The first
learning material is repeatedly selected to improve the learner’s ability of addition at the
beginning. Then the third material related to both addition and subtraction is selected. In the
last few steps, the second learning material is chosen to further improve the learner’s ability of
subtraction.
Psychometrika Submission April 21, 2020 23

Episode Reward −10

−20

−30
DQN
−40 Heuristic

0% 0.5% 1% 2% 3% 3.3% 4% 5%
Error Standard Deviation

Figure 8.
Comparison of rewards under DQN and heuristic learning policies with various estimation errors.

Figure 8 compares rewards under the DQN and the heuristic learning policies when
estimation errors with various standard deviations (σ) exist. It shows that the mean rewards
obtained by the DQN learning policy under various estimation errors are consistently higher than
those of the heuristic learning policy when estimation errors exist. That is, the DQN learning
policy still outperforms the heuristic learning policy even with the presence of estimation errors,
which demonstrates that the deep Q-learning algorithm is reliable and stable in finding optimal
learning policy with the presence of estimation errors.

Simulation Study II

Next, we show the performance of the adaptive learning system with a transition model
estimator, which is represented using a neural network with one hidden layer that has 32 units.
The prediction accuracy indices are presented in Table 2. The train and test scores are defined as
the coefficient of determination in the training and test sets respectively, calculated by
PH
||s(i) − ŝ(i) ||2
1 − Pi=1 H
, (21)
||s(i) − s||2
i=1

where s is the true state, s is average value of the true state, ŝ is the predicted state using
previous state and the action taken, and H is the number of the transitions. The best possible
Psychometrika Submission April 21, 2020 24

score is 1. The root mean square error (RMSE) is calculated by


s
PH (i) − ŝ(i) ||2
i=1 ||s
RM SE = . (22)
H

−10
Episode Reward

−20

−30 Actual
Virtual

10 20 30 40 50 100 150 200 2000


Episodes

Figure 9.
Comparison of rewards under actual and virtual DQN learning policies.

A DQN is trained on 2000 episodes against the estimated transition model that is fitted
using a certain number of actual learners; the learning policy corresponding to this DQN is
referred to as the virtual DQN learning policy. For the purpose of comparison, another DQN is
trained on the same number of actual learners; the learning policy corresponding to this DQN is
referred to as the actual DQN learning policy. Essentially, these two learning policies differ in the
way how the same set of actual learners are utilized. The actual learners are simulated according
to the method discussed in “Simulation Overview” section and are used to train the actual DQN
learning policy directly. In contrast, these actual learners are used to first fit a transition model,
which is then used to train the virtual DQN learning policy; this allows the virtual DQN learning
policy to be trained over as many episodes as it needs. Figure 9 compares rewards obtained by
the two DQN learning policies when various numbers of actual learners are utilized. It is shown
that with no more than 200 actual learners, the utilization of the transition model can
significantly improve the performance of the learning policy, generating much larger mean rewards
Psychometrika Submission April 21, 2020 25

compared than the algorithm without using the transition model. When the number of learners is
large enough, both two approaches found optimal learning policies and yield similar rewards.

Concluding Remarks and Future Directions

In this paper, we developed an MDP formulation for an adaptive learning system by


describing learners’ latent traits as continuous instead of simply classifying learners as mastery or
non-mastery of certain skills. The objective of the system is to improve learners’ abilities to the
prespecified target levels. We developed a deep Q-learning algorithm, which is a model-free DRL
algorithm that can effectively find the optimal learning policy from data on learners’ learning
process without knowing the transition model of the learner’s latent traits. To cope with the
challenge of insufficient state transition data, which may result in a poor performance of the deep
Q-learning algorithm, we developed a transition model estimator that emulates the learner’s
learning process using neural networks, which can be used to further train the DQN and improve
the its performance.
The two simulation studies presented in the paper verified that the proposed methodology is
very efficient in finding a good learning policy for adaptive learning systems without any help
from a teacher. The optimal learning policy found by the DQN algorithm outperformed the
heuristic and random methods with much higher rewards, or equivalently, much fewer learning
steps/cycles for learners to reach the target levels of all prespecified abilities. Particularly, with
the aid of a transition model estimator, the adaptive learning system can find a good learning
policy efficiently after training using a few learners.
The directions for extending the adaptive learning research include applying the adaptive
learning system on actual learners to further assess the efficiency of the proposed methodology.
Both the DQN algorithm and the transition model estimator can be adopted and evaluated
through real data analysis on an online learning platform. Second, the adaptive learning system
here consists of a latent trait estimator which uses measurement models to estimate latent traits
and a learning policy. Instead, some research construct the system assuming that learning
materials influence learners’ responses to test items directly, without the latent trait estimator
incorporated (Lan et al., 2014; Lan and Baraniuk, 2016). As such, learners’ learning process is
Psychometrika Submission April 21, 2020 26

modeled and traced directly and model-free algorithms can be proposed to find the optimal
learning policy. Third, because each group of learners assumes to follow a homogeneous MDP,
further researches can be conducted to classify learners into groups before they use the adaptive
learning system in order to find the optimal learning policy for each group. Fourth, machine
learning algorithms for recommendation systems (e.g., collaborative filtering, matrix
decomposition, etc.) can be incorporated with the DRL algorithm to better recommend not only
optimal but also preferred materials to learners (Li et al., 2010). Fifth, it would be interesting to
further examine how much better is the DQN learning policy than the random and heuristic
learning policies under different scenarios and constrains. Sixth, it is also interesting to formulate
the adaptive learning problem as a partially observable Markov decision process (POMDP) and
explore solutions to the problem. Finally, future studies can include the modeled learning paths
(Chen et al., 2018a; Wang et al., 2018) as learners’ historical data to search the optimal learning
policy more efficiently.
Psychometrika Submission April 21, 2020 27

References

Ackerman, T. A., Gierl, M. J., and Walker, C. M. (2003). Using multidimensional item response
theory to evaluate educational and psychological tests. Educational Measurement: Issues and
Practice, 22(3):37–51.

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic programming, volume 5. Athena


Scientific Belmont, MA.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability.
Statistical theories of mental test scores, pages 397–472.

Caicedo, J. C. and Lazebnik, S. (2015). Active object localization with deep reinforcement
learning. In Proceedings of the IEEE International Conference on Computer Vision, pages
2488–2496.

Chang, H.-H. (2015). Psychometrics behind computerized adaptive testing. Psychometrika,


80(1):1–20.

Chen, Y., Culpepper, S. A., Wang, S., and Douglas, J. (2018a). A hidden markov model for
learning trajectories in cognitive diagnosis with application to spatial rotation skills. Applied
Psychological Measurement, 42(1):5–23.

Chen, Y., Li, X., Liu, J., and Ying, Z. (2018b). Recommendation system for adaptive learning.
Applied Psychological Measurement, 42(1):24–41.

Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38.

François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., Pineau, J., et al. (2018). An
introduction to deep reinforcement learning. Foundations and Trends R in Machine Learning,
11(3-4):219–354.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.
Psychometrika Submission April 21, 2020 28

Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep reinforcement learning for robotic
manipulation with asynchronous off-policy updates. In 2017 IEEE International Conference on
Robotics and Automation (ICRA), pages 3389–3396. IEEE.

Lan, A. S. and Baraniuk, R. G. (2016). A contextual bandits framework for personalized learning
action selection. In Proceedings of the 9th International Conference on Educational Data
Mining, Raleigh, NC: EDM, pages 424–429.

Lan, A. S., Studer, C., and Baraniuk, R. G. (2014). Time-varying learning and content analytics
via sparse factor analysis. In Proceedings of the 20th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 452–461. ACM.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to
personalized news article recommendation. In Proceedings of the 19th international conference
on World wide web, pages 661–670. ACM.

Li, X., Xu, H., Zhang, J., and Chang, H.-h. (2018). Optimal hierarchical learning path design
with reinforcement learning. arXiv preprint arXiv:1810.05347.

Lord, F. M. (1980). Application of item response theory to practical testing problems.

Lord, F. M., Novick, M. R., and Birnbaum, A. (1968). Statistical theories of mental test scores.
1968. Reading: Addison-Wesley Google Scholar.

Makransky, G. and Glas, C. A. (2014). An automatic online calibration design in adaptive


testing. Journal of Applied Testing Technology, 11(1):1–20.

Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2):149–174.

McGlohen, M. and Chang, H.-H. (2008). Combining computer adaptive testing technology with
cognitively diagnostic assessment. Behavior Research Methods, 40(3):808–821.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller,
M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Psychometrika Submission April 21, 2020 29

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,
Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through
deep reinforcement learning. Nature, 518(7540):529.

Mulaik, S. (1972). A mathematical investigation of some multidimensional rasch models for


psychological tests. In Annual Meeting of the Psychometric Society, Princeton, NJ.

Muraki, E. (1992). A generalized partial credit model: Application of an em algorithm. ETS


Research Report Series, 1992(1):i–30.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen:
Danish Institute for Educational Research.

Reckase, M. D. (1972). Development and application of a multivariate logistic latent trait model.

Reddy, S., Levine, S., and Dragan, A. (2017). Accelerating human learning with deep
reinforcement learning. In NIPS’17 Workshop: Teaching Machines, Robots, and Humans.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika monograph supplement.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Sympson, J. B. (1978). A model for testing with multidimensional items. In Proceedings of the
1977 computerized adaptive testing conference, number 00014. University of Minnesota,
Department of Psychology, Psychometric Methods.

Tang, X., Chen, Y., Li, X., Liu, J., and Ying, Z. (2019). A reinforcement learning approach to
personalized learning recommendation systems. British Journal of Mathematical and Statistical
Psychology, 72(1):108–135.

Tseng, F.-L. and Hsu, T.-C. (2001). Multidimensional adaptive testing using the weighted
likelihood estimation: a comparison of estimation methods. In Annual meeting of NCME,
Seattle.
Psychometrika Submission April 21, 2020 30

Wang, C. (2015). On latent trait estimation in multidimensional compensatory item response


models. Psychometrika, 80(2):428–449.

Wang, S., Yang, Y., Culpepper, S. A., and Douglas, J. A. (2018). Tracking skill acquisition with
cognitive diagnosis models: A higher-order, hidden markov model with covariates. Journal of
Educational and Behavioral Statistics, 43(1):57–87.

Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory.


Psychometrika, 54(3):427–450.

Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied
Psychological Measurement, 6(4):473–492.

Whitely, S. E. (1980). Multicomponent latent trait models for ability tests. Psychometrika,
45(4):479–494.

Xu, H., Sun, H., Nikovski, D., Kitamura, S., Mori, K., and Hashimoto, H. (2019). Deep
reinforcement learning for joint bidding and pricing of load serving entity. IEEE Transactions
on Smart Grid, pages 1–1.

Xu, J., Xing, T., and Van Der Schaar, M. (2016). Personalized course sequence recommendations.
IEEE Transactions on Signal Processing, 64(20):5340–5352.

Zhang, J. (2013). A procedure for dimensionality analyses of response data from various test
designs. Psychometrika, 78(1):37–58.

Zhang, J., Xie, M., Song, X., and Lu, T. (2011). Investigating the impact of uncertainty about
item parameters on ability estimation. Psychometrika, 76(1):97–118.

Zhang, S. and Chang, H.-H. (2016). From smart testing to smart learning: how testing
technology can assist the new generation of education. International Journal of Smart
Technology and Learning, 1(1):67–92.
Psychometrika Submission April 21, 2020 31

Algorithm 1: Deep Q Learning Algorithm for Adaptive Learning Problem


Input: γ, α, ǫ, ǫ, τǫ , M, E
Output: w
Randomly initialize w and set total time step counter τ = 0
for episode = 1, · · · , E do
Receive initial state s0
for t = 0, 1, · · · , do
Compute ǫ(t) = ǫ − (ǫ − ǫ) × min(τ /τǫ , 1) and increase τ by 1
With probability ǫ(t) select a random action a(t) otherwise select
a(t) = maxa Q̂(s(t) , a; w (t) )
Send the learning material determined by a(t) to the learner
Given the learner a test and collect test response
Receive new state s(t+1) estimated from test response by latent trait estimator
Compute reward r (t) according to

 −1, if ||s(t+1) − 1D ||∞ ≥ 10−3
r (t) =
 0, otherwise
Store transition (s(t) , a(t) , r (t) , s(t+1) ) into H
Sample M transitions from H and store them into M
Update w according to
X ∂ Q̂(s, a; w (t) )
w (t+1) = w (t) − α (Q̂(s, a; w (t) ) − y)
∂w
(s,a,r,s′ )∈M

where

 r, if ||s′ − 1D ||∞ < 10−3
y=
 r + γ max ′ Q̂(s′ , a′ ; w (t) ), otherwise
a
if ||s′ − 1D ||∞ < 10−3 , break
end
end
Psychometrika Submission April 21, 2020 32

Table 1.
Mean and Standard Deviation (SD) of Rewards under DQN, Heuristic, and Random Learning Policies.

Methods DQN Heuristic Random

Reward mean -13.49 -21.55 -24.85


Reward SD 4.59 4.76 5.59

Table 2.
Accuracy of Transition Model Trained against Various Numbers of Learners.

No. of learners 10 20 30 40 50 100 150 200 2000

Train Score 0.96 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97
Test Score 0.95 0.97 0.96 0.96 0.97 0.97 0.97 0.97 0.97
RMSE 0.11 0.08 0.09 0.09 0.08 0.08 0.08 0.08 0.08

You might also like