Handbook of Educational Data Mining PDF
Handbook of Educational Data Mining PDF
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A
PUBLISHED TITLES
UNDERSTANDING COMPLEX DATASETS: TEXT MINING: CLASSIFICATION, CLUSTERING,
DATA MINING WITH MATRIX DECOMPOSITIONS AND APPLICATIONS
David Skillicorn Ashok N. Srivastava and Mehran Sahami
COMPUTATIONAL METHODS OF FEATURE BIOLOGICAL DATA MINING
SELECTION Jake Y. Chen and Stefano Lonardi
Huan Liu and Hiroshi Motoda
INFORMATION DISCOVERY ON ELECTRONIC
CONSTRAINED CLUSTERING: ADVANCES IN HEALTH RECORDS
ALGORITHMS, THEORY, AND APPLICATIONS Vagelis Hristidis
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
TEMPORAL DATA MINING
KNOWLEDGE DISCOVERY FOR Theophano Mitsa
COUNTERTERRORISM AND LAW ENFORCEMENT
David Skillicorn RELATIONAL DATA CLUSTERING: MODELS,
ALGORITHMS, AND APPLICATIONS
MULTIMEDIA DATA MINING: A SYSTEMATIC Bo Long, Zhongfei Zhang, and Philip S. Yu
INTRODUCTION TO CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, STATISTICAL DATA MINING USING SAS
Rajeev Motwani, and Vipin Kumar APPLICATIONS, SECOND EDITION
George Fernandez
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada INTRODUCTION TO PRIVACY-PRESERVING DATA
PUBLISHING: CONCEPTS AND TECHNIQUES
THE TOP TEN ALGORITHMS IN DATA MINING Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu,
Xindong Wu and Vipin Kumar and Philip S. Yu
GEOGRAPHIC DATA MINING AND HANDBOOK OF EDUCATIONAL DATA MINING
KNOWLEDGE DISCOVERY, SECOND EDITION Cristóbal Romero, Sebastian Ventura,
Harvey J. Miller and Jiawei Han Mykola Pechenizkiy, and Ryan S.J.d. Baker
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
Handbook of
Educational Data Mining
Edited by
Cristóbal Romero, Sebastian Ventura,
Mykola Pechenizkiy, and Ryan S.J.d. Baker
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the
accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® software or related products
does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular
use of the MATLAB® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2011 by Taylor and Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my wife, Ana, and my son, Cristóbal
Cristóbal Romero
Sebastián Ventura
Mykola Pechenizkiy
Ryan S. J. d. Baker
Contents
Preface...............................................................................................................................................xi
Editors..............................................................................................................................................xv
Contributors................................................................................................................................. xvii
1. Introduction..............................................................................................................................1
Cristóbal Romero, Sebastián Ventura, Mykola Pechenizkiy, and Ryan S. J. d. Baker
vii
viii Contents
11. Novel Derivation and Application of Skill Matrices: The q-Matrix Method.........159
Tiffany Barnes
29. Using Fine-Grained Skill Models to Fit Student Performance with Bayesian
Networks............................................................................................................................... 417
Zachary A. Pardos, Neil T. Heffernan, Brigham S. Anderson, and Cristina L. Heffernan
xi
xii Preface
are strong in statistical and computational techniques, but techniques and data are not suf-
ficient to advance a scientific domain; researchers with basic understanding of the teach-
ing and learning process are also required. Thus, education researchers and psychologists
are key participants in the EDM community.
teacher is a better choice. The first example is at the edge of what EDM is capable; the sec-
ond is, for now, beyond our capabilities.
This job of expanding our horizons and determining what are new, exciting questions to
ask the data is necessary for EDM to grow.
The third avenue of EDM is finding who are educational stakeholders that could benefit
from the richer reporting made possible with EDM. Obvious interested parties are stu-
dents and teachers. However, what about the students’ parents? Would it make sense for
them to receive reports? Aside from report cards and parent–teacher conferences, there
is little communication to parents about their child’s performance. Most parents are too
busy for a detailed report of their child’s school day, but what about some distilled infor-
mation? A system that informed parents if their child did not complete the homework
that was due that day could be beneficial. Similarly, if a student’s performance notice-
ably declines, such a change would be detectable using EDM and the parents could be
informed. Other stakeholders include school principals, who could be informed of teach-
ers who were struggling relative to peers, and areas in which the school was performing
poorly. Finally, there are the students themselves. Although students currently receive an
array of grades on homework, quizzes, and exams, they receive much less larger-grain
information, such as using the student’s past performance to suggest which classes to
take, or that the student’s homework scores are lower than expected based on exam per-
formance. Note that such features also change the context of educational data from some-
thing that is used in the classroom, to something that is potentially used in a completely
different place.
Research in this area focuses on expanding the list of stakeholders for whom we can
provide information, and where this information is received. Although there is much
potential work in this area that is not technically demanding, notifying parents of missed
homework assignments is simple enough, such work has to integrate with a school’s IT
infrastructure, and changes the ground rules. Previously, teachers and students controlled
information flow to parents; now parents are getting information directly. Overcoming
such issues is challenging. Therefore, this area has seen some attention, but is relatively
unexplored by EDM researchers.
The field of EDM has grown substantially in the past five years, with the first work-
shop referred to as “Educational data mining” occurring in 2005. Since then, it has held
its third international conference in 2010, had one book published, has its own online
journal, and is now having this book published. This growth is exciting for multiple
reasons. First, education is a fundamentally important topic, rivaled only by medi-
cine and health, which cuts across countries and cultures. Being able to better answer
age-old questions in education, as well as finding ways to answer questions that have
not yet been asked, is an activity that will have a broad impact on humanity. Second,
doing effective educational research is no longer about having a large team of graduate
assistants to score and code data, and sufficient offices with filing cabinets to store the
results. There are public repositories of educational data sets for others to try their hand
at EDM, and anyone with a computer and Internet connection can join the community.
Thus, a much larger and broader population can participate in helping improve the state
of education.
This book is a good first step for anyone wishing to join the EDM community, or for
active researchers wishing to keep abreast of the field. The chapters are written by key
EDM researchers, and cover many of the field’s essential topics. Thus, the reader gets a
broad treatment of the field by those on the front lines.
xiv Preface
Joseph E. Beck
Worcester Polytechnic Institute, Massachusetts
Editors
xv
xvi Editors
xvii
xviii Contributors
Cristina L. Heffernan
Kenneth R. Koedinger
Department of Computer Science
Human–Computer Interaction Institute
Worcester Polytechnic Institute
Carnegie Mellon University
Worcester, Massachusetts
Pittsburgh, Pennsylvania
Neil T. Heffernan
Department of Computer Science Irena Koprinska
Worcester Polytechnic Institute School of Information Technologies
Worcester, Massachusetts University of Sydney
Sydney, New South Wales, Australia
Cecily Heiner
Language Technologies Institute
Brett Leber
Carnegie Mellon University
Human–Computer Interaction Institute
Pittsburgh, Pennsylvania
Carnegie Mellon University
and Pittsburgh, Pennsylvania
Computer Science Department
University of Utah Tara M. Madhyastha
Salt Lake City, Utah Department of Psychology
University of Washington
Arnon Hershkovitz Seattle, Washington
Knowledge Technology Lab
School of Education David Masip
Tel Aviv University Department of Computer Science,
Tel Aviv, Israel
Multimedia and Telecommunications
Universitat Oberta de Catalunya
Earl Hunt
Barcelona, Spain
Department of Psychology
University of Washington
Seattle, Washington Manolis Mavrikis
London Knowledge Lab
Octavio Juarez The University of London
Robotics Institute London, United Kingdom
Carnegie Mellon University
Pittsburgh, Pennsylvania
Riccardo Mazza
Brian W. Junker Faculty of Communication Sciences
Department of Statistics University of Lugano
Carnegie Mellon University Lugano, Switzerland
Pittsburgh, Pennsylvania and
Judy Kay Department of Innovative Technologies
School of Information Technologies University of Applied Sciences of Southern
University of Sydney Switzerland
Sydney, New South Wales, Australia Manno, Switzerland
xx Contributors
Victor H. Menendez
Facultad de Matemáticas John C. Nesbit
Universidad Autónoma de Yucatán Faculty of Education
Merida, Mexico Simon Fraser University
Burnaby, British Columbia, Canada
Agathe Merceron
Media and Computer Science Department Engelbert Mephu Nguifo
Beuth University of Applied Sciences Department of Computer Sciences
Berlin, Germany Université Blaise-Pascal Clermont 2
Clermont-Ferrand, France
Amelia Zafra
Department of Computer Science and
Numerical Analysis
University of Cordoba
Cordoba, Spain
1
Introduction
Contents
1.1 Background.............................................................................................................................. 1
1.2 Educational Applications....................................................................................................... 3
1.3 Objectives, Content, and How to Read This Book............................................................. 4
References..........................................................................................................................................5
1.1╇ Background
In the last years, researchers from a variety of disciplines (including computer science,
statistics, data mining, and education) have begun to investigate how data mining can
improve education and facilitate education research. Educational data mining (EDM) is
increasingly recognized as an emerging discipline [10]. EDM focuses on the development
of methods for exploring the unique types of data that come from an educational context.
These data come from several sources, including data from traditional face-to-face class-
room environments, educational software, online courseware, and summative/high-stakes
tests. These sources increasingly provide vast amounts of data, which can be analyzed to
easily address questions that were not previously feasible, involving differences between
student populations, or involving uncommon student behaviors. EDM is contributing to
education and education research in a multitude of ways, as can be seen from the diver-
sity of educational problems considered in the following chapters of this volume. EDM’s
contributions have influenced thinking on pedagogy and learning, and have promoted
the improvement of educational software, improving software’s capacity to individualize
students’ learning experiences. As EDM matures as a research area, it has produced a con-
ference series (The International Conference on Educational Data Mining—as of 2010, in
its third iteration), a journal (the Journal of Educational Data Mining), and a number of highly
cited papers (see [2] for a review of some of the most highly cited EDM papers).
These contributions in education build off of data mining’s past impacts in other
domains such as commerce and biology [11]. In some ways, the advent of EDM can be con-
sidered as education “catching up” to other areas, where improving methods for exploiting
data have promoted transformative impacts in practice [4,7,12]. Although the discovery
methods used across domains are similar (e.g. [3]), there are some important differences
between them. For instance, in comparing the use of data mining within e-commerce and
EDM, there are the following differences:
1
2 Handbook of Educational Data Mining
Educational systems
(traditional classrooms, e-learning
Use, interact with,
systems, LMSs, web-based
participate in, design,
adaptive systems, intelligent
plan, build and maintain
tutoring systems, questionnaires
and quizzes)
Users
(students, learners, Provide, store:
instructors, Course information, contents,
teachers, course academic data, grades,
administrators, academic student usage and interaction data
researchers, school district
officials) Data mining techniques
(statistics, visualization, clustering,
Model learners and learning, classification, association rule
communicate findings, make mining, sequence mining, text
recommendations mining)
FIGURE 1.1
Applying data mining to the design of educational systems.
Chapters 2 through 4, 9, 12, 22, 24, and 28 discuss methods and case studies for this
category of application.
• Maintaining and improving courses. The objective is to help to course administrators
and educators in determining how to improve courses (contents, activities, links,
etc.), using information (in particular) about student usage and learning. The most
frequently used techniques for this type of goal are association, clustering, and
classification. Chapters 7, 17, 26, and 34 discuss methods and case studies for this
category of application.
• Generating recommendation. The objective is to recommend to students which con-
tent (or tasks or links) is most appropriate for them at the current time. The most
frequently used techniques for this type of goal are association, sequencing, clas-
sification, and clustering. Chapters 6, 8, 12, 18, 19, and 32 discuss methods and case
studies for this category of application.
• Predicting student grades and learning outcomes. The objective is to predict a student’s
final grades or other types of learning outcomes (such as retention in a degree
program or future ability to learn), based on data from course activities. The most
frequently used techniques for this type of goal are classification, clustering, and
association. Chapters 5 and 13 discuss methods and case studies for this category
of application.
• Student modeling. User modeling in the educational domain has a number of appli-
cations, including for example the detection (often in real time) of student states
and characteristics such as satisfaction, motivation, learning progress, or certain
types of problems that negatively impact their learning outcomes (making too
many errors, misusing or underusing help, gaming the system, inefficiently explor-
ing learning resources, etc.), affect, learning styles, and preferences. The common
objective here is to create a student model from usage information. The frequently
used techniques for this type of goal are not only clustering, classification, and
association analysis, but also statistical analyses, Bayes networks (including
Bayesian Knowledge-Tracing), psychometric models, and reinforcement learning.
Chapters 6, 12, 14 through 16, 20, 21, 23, 25, 27, 31, 33, and 35 discuss methods and
case studies for this category of application.
• Domain structure analysis. The objective is to determine domain structure, using
the ability to predict student performance as a measure of the quality of a domain
structure model. Performance on tests or within a learning environment is uti-
lized for this goal. The most frequently used techniques for this type of goal are
association rules, clustering methods, and space-searching algorithms. Chapters
10, 11, 29, and 30 discuss methods and case studies for this category of application.
courses and primary and secondary schools. For instance, 6% of U.S. high schools now use
Cognitive Tutor software for mathematics learning (cf. [6]). As these environments become
more widespread, ever-larger collections of data have been obtained by educational data
repositories. A case study on one of the largest of these repositories is given in the chapter
on the PSLC DataShop by Koedinger and colleagues.
This expansion of data has led to increasing interest among education researchers in a
variety of disciplines, and among practitioners and educational administrators, in tools
and techniques for analysis of the accumulated data to improve understanding of learners
and learning process, to drive the development of more effective educational software and
better educational decision-making. This interest has become a driving force for EDM. We
believe that this book can support researchers and practitioners in integrating EDM into
their research and practice, and bringing the educational and data mining communities
together, so that education experts understand what types of questions EDM can address,
and data miners understand what types of questions are of importance to educational
design and educational decision-making.
This volume, the Handbook of Educational Data Mining, consists of two parts. In the first
part, we offer nine surveys and tutorials about the principal data mining techniques that
have been applied in education. In the second part, we give a set of 25 case studies, offering
readers a rich overview of the problems that EDM has produced leverage for.
The book is structured so that it can be read in its entirety, first introducing concepts
and methods, and then showing their applications. However, readers can also focus on
areas of specific interest, as have been outlined in the categorization of the educational
applications. We welcome readers to the field of EDM and hope that it is of value to their
research or practical goals. If you enjoy this book, we hope that you will join us at a future
iteration of the Educational Data Mining conference; see www.educationaldatamining.org
for the latest information, and to subscribe to our community mailing list, edm-announce.
References
1. Arruabarrena, R., Pérez, T. A., López-Cuadrado, J., and Vadillo, J. G. J. (2002). On evaluating
adaptive systems for education. In International Conference on Adaptive Hypermedia and Adaptive
Web-Based Systems, Málaga, Spain, pp. 363–367.
2. Baker, R.S.J.d. and Yacef, K. (2009). The state of educational data mining in 2009: A review and
future visions. Journal of Educational Data Mining, 1(1), 3–17.
3. Hanna, M. (2004). Data mining in the e-learning domain. Computers and Education Journal, 42(3),
267–287.
4. Hirschman, L., Park, J.C., Tsujii, J., Wong, W., and Wu, C.H. (2002). Accomplishments and chal-
lenges in literature data mining for biology. Bioinformatics, 18(12), 1553–1561.
5. Ingram, A. (1999). Using web server logs in evaluating instructional web sites. Journal of
Educational Technology Systems, 28(2), 137–157.
6. Koedinger, K. and Corbett, A. (2006). Cognitive tutors: Technology bringing learning science to
the classroom. In K. Sawyer (Ed.), The Cambridge Handbook of the Learning Sciences. Cambridge,
U.K.: Cambridge University Press, pp. 61–78.
7. Lewis, M. (2004). Moneyball: The Art of Winning an Unfair Game. New York: Norton.
8. Li, J. and Zaïane, O. (2004). Combining usage, content, and structure data to improve web
site recommendation. In International Conference on Ecommerce and Web Technologies, Zaragoza,
Spain, pp. 305–315.
6 Handbook of Educational Data Mining
9. Pahl, C. and Donnellan, C. (2003). Data mining technology for the evaluation of web-based
teaching and learning systems. In Proceedings of the Congress e-Learning, Montreal, Canada.
10. Romero, C. and Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert
Systems with Applications, 33(1), 135–146.
11. Srivastava, J., Cooley, R., Deshpande, M., and Tan, P. (2000). Web usage mining: Discovery and
applications of usage patterns from web data. SIGKDD Explorations, 1(2), 12–23.
12. Shaw, M.J., Subramanian, C., Tan, G.W., and Welge, M.E. (2001). Knowledge management and
data management for marketing. Decision Support Systems, 31(1), 127–137.
Part I
Riccardo Mazza
Contents
2.1 Introduction.............................................................................................................................9
2.2 What Is Information Visualization?................................................................................... 10
2.2.1 Visual Representations............................................................................................ 10
2.2.2 Interaction.................................................................................................................. 11
2.2.3 Abstract Data............................................................................................................. 11
2.2.4 Cognitive Amplification.......................................................................................... 12
2.3 Design Principles.................................................................................................................. 13
2.3.1 Spatial Clarity............................................................................................................ 14
2.3.2 Graphical Excellence................................................................................................. 14
2.4 Visualizations in Educational Software............................................................................ 16
2.4.1 Visualizations of User Models................................................................................ 16
2.4.1.1 UM/QV........................................................................................................ 16
2.4.1.2 ViSMod........................................................................................................ 17
2.4.1.3 E-KERMIT................................................................................................... 18
2.4.2 Visualizations of Online Communications.......................................................... 19
2.4.2.1 Simuligne.................................................................................................... 19
2.4.2.2 PeopleGarden............................................................................................. 20
2.4.3 Visualizations of Student-Tracking Data............................................................... 20
2.5 Conclusions............................................................................................................................ 24
References........................................................................................................................................ 25
2.1╇ Introduction
This chapter presents an introduction to information visualization, a new discipline with
origins in the late 1980s that is part of the field of human–computer interaction. We will
illustrate the purposes of this discipline, its basic concepts, and some design principles
that can be applied to graphically render students’ data from educational systems. The
chapter starts with a description of information visualization followed by a discussion
on some design principles, which are defined by outstanding scholars in the field. Finally,
some systems in which visualizations have been used in learning environments to repre-
sent user models, discussions, and tracking data are described.
9
10 Handbook of Educational Data Mining
* A Dictionary of Computing. Oxford University Press, 1996. Oxford Reference Online. Oxford University Press.
Visualization in Educational Environments 11
the end. This can be defined as a scrutiny task, because it is a conscious operation that
involves memory, semantics, and symbolism.
Let us try to do the same operation, this time using the bars on the left. The length of
the bars lets us to identify almost immediately the longest and the shortest thanks to the
pre-attentive property of length, the length of the bars allows us to almost immediately
identify the longest and the shortest.
Graphical representations are often associated with the term “visualization” (or “visu-
alisation” in the British version of the term). It has been noted by Spence [16] that there is
a diversity of uses of the term “visualization.” For instance, in a dictionary the following
definitions can be found:
These definitions reveal that visualization is an activity in which humans are engaged, as
an internal construct of the mind [16,20]. It is something that cannot be printed on a paper
or displayed on a computer screen. With these considerations, we can summarize that
visualization is a cognitive activity, facilitated by graphical external representations from
which people construct internal mental representation of the world [16,20].
Computers may facilitate the visualization process with some visualization tools. This
is especially true in recent years with the availability of powerful computers at low cost.
However, the above definition is independent from computers: although computers can
facilitate visualization, it still remains an activity that happens in the mind.
2.2.2╇ Interaction
Recently there has been great progress in high-performance, affordable computer graphics.
The common personal computer has reached a graphic power that just 10 years ago was
possible only with very expensive graphic workstations specifically built for the graphic
process. At the same time, there has been a rapid expansion in information that people have
to process for their daily activities. This need led scientists to explore new ways to represent
huge amounts of data with computers, taking advantage of the possibility of users interact-
ing with the algorithms that create the graphical representation. Interactivity derives from
the people’s ability to also identify interesting facts when the visual display changes and
allows them to manipulate the visualization or the underlying data to explore such changes.
* The Concise Oxford Dictionary. Ed. Judy Pearsall. Oxford University Press, 2001. Oxford Reference Online. Oxford
University Press.
† A Dictionary of Computing. Oxford University Press, 1996. Oxford Reference Online. Oxford University Press.
‡ Merriam-Webster Online Dictionary. http://www.webster.com
12 Handbook of Educational Data Mining
earth), and data that is more abstract in nature (e.g., the stock
market fluctuations). The former is known as scientific visu-
alization, and the latter as IV [4,16,19].
Scientific visualization was developed in response to the
needs of scientists and engineers to view experimental or
phenomenal data in graphical formats (an example is given
in Figure 2.2), while IV is dealing with unstructured data
sets as a distinct flavor [4]. In Table 2.1 is reported a table
with some examples of abstract data and physical data.
However, we ought to say that this distinction is not strict,
and sometimes abstract and physical data are combined in a FIGURE 2.2
single representation. For instance, the results from the last Example of scientific visualiza-
Swiss federal referendum on changing the Swiss law on asy- tion: The ozone hole the South
lum can be considered a sort of abstract data if the goal of the Pole on September 22, 2004.
graphical representation is to highlight the preference (yes (Image from the NASA Goddard
Space Center archives and repro-
or no) with respect to the social status, age, sex, etc. of the duced with permission.)
voter. But if we want to highlight the percentage that the ref-
erendum got in each town, a mapping with the geographical location might be helpful to
see how the linguistic regions, cantons, and the proximity with the border influenced the
choice of the electorate (see Figure 2.3).
2 7 ×
4 2
5 4
1 0 8 —
1 1 3 4
This example shows how visual and manipulative use of the external representations and
processing amplifies cognitive performance. Graphics use the visual representations that
help to amplify cognition. They convey information to our minds that allows us to search
TABLE 2.1
Some Examples of Abstract Data and
Physical Data
Abstract Data Physical Data
Names Data gathered from
instruments
Grades Simulations of wind flow
News or stories Geographical locations
Jobs Molecular structure
Visualization in Educational Environments 13
ZG
LU
Provisorische Ergebnisse SZ
Résultats provisoires NE GL
NW
BE UR
OW GR
FR
VD
GE
TI
VS 0 25 50 km
Schweiz/Suisse
Stimmbeteiligung/participation: 48.4%
Abstimmung vom 24. September 2006
Ja-Stimmenanteil/proportion de «oui»: 67.8%
Votation du 24 Septembre 2006
Abst.–Nr./n ° vot.: 525
Schweizerische Eidgenossenschaft Eidgenössisches Departement des Innern EDI Quelle: Abstimmungsstatistik, BFS
Confédération suisse Département fédéral de l΄intérieur DFI Source: Statistique des votations, OFS
Confederazione svizzera Bundesamt für Statistik BFS © BFS, ThemaKart, Neuenburg 2006/K17.A525.R_bz
Confederaziun svizra Office fédéral de la statistique OFS © OFS, ThemaKart, Neuchâtel 2006/K17.A525.R_bz
FIGURE 2.3
Graphical representation of results of federal referendum in Switzerland on September 24, 2006. (Image from
the Swiss Federal Statistical Office, http://www.bfs.admin.ch. © Bundesamt für Statistik, ThemaKart 2009,
reproduced with permission.)
for patterns, recognize relationship between data, and perform some inferences more easily.
Card et al. [2] propose six major ways in which visualizations can amplify cognition by
La Manuel de Reyes
Moraleja Falla Católicos
10 Hospital del Norte
La Granja Marqués Baunatal
Ronda de la Comunicación de la
Las Tablas Valdavia
Montecarmelo
Tres Olivos
Herrera Oria 9 Fuencarral Pinar de chamartin Parque de Aeropuerto T4
Pitis 1 Manoteras
Barrio del Pilar Begoña Santa María 8
7 4
Ventilla San lorenzo
Arroyo del Fresno Bambú Hortaleza Barajas
Chamartín Mar de
Lacoma Pío XII Pinar del Rey Aeropuerto T1-T2-T3
Valdeacederas Cristal
Avenida de la Illustración
Peñagrande Duque de Campo de las
Tetuán Plaza de Colombia
Antonio Machado Castilla
Pastrana Canillas Naciones
Valdezarza Estrecho Cuzco Esperanza Alameda
Santiago Concha de Osuna
Francos Rodriguez Alvarado Arturo Soria 5
Bernabéu Espina
Cuatro Nuevos República Cruz del El Capricho
Guzmán el Bueno Caminos Ministerios Argentina Avenida de la Paz Canillejas
2 8 Rayo
Metropolitano Alfonso XIII Torre Arias
Ciudad Rios Rosas 6 Prosperidad Suanzes
Universitária Islas Alonso Gregorio Pque. de las Ciudad Lineal
Filipinas Cano Marañón Cartagena Avenidas Barrio de la
Canal
3 Moncloa Concepción
Avda. de América Diego de León El
Quevedo Iglesia
Argüelles San Bernardo Rubén Carmen
4 Núñez Ventas Pueblo Nuevo
Dario Quintana
Bilbao de Balboa
Ventura Lista 2 La Elipa
Rodriguez Noviciado Colón Velázquez Ascao
Pza. de España Garcia Noblejas
Tribunal Alonso Serrano Goya Manuel Becerra Simancas
Martínez San Blas
Santo Gran Príncipe de Vergara Las Musas
Príncipe Pío R O΄Donnell
domingo Via Chueca
Retiro Estadio Olimpico
Lago Callao Ibiza
Sevilla Barrio del Puerto
6 Sainz de Baranda Coslada Central
Batán Pta. del Banco de La Rambla
Angel Ópera R España San Fernando
Sol
5 Casa de Tirso de Molina Estrella Jarama
Alto de La
Campo Antón Martin Vinateros
Extremadura Latina Conde de 7
Campamento Lucero Pta. de Lavapiés Atocha Artilleros Henares
6 Casal
Empalme Toledo Atocha Renfe Pavones
Aluche Laguna Valdebernardo
Acacias Embajadores
Carpetana Menéndez Vicálvaro
Pirámides Pacífico
Eugenia Pelayo San Cipriano
de Montijo Urgel Palos de la Frontera Puente de Vallecas
Marqués Méndez
Delicias Nueva Numancia Puerta de Arganda
Carabanchel Vista Oporto de Vadillo Álvardo
Portazgo
Alegre Arganzuela-
Usera Buenos Aires Rivas Urbanizaciones
Colonia Jardin Planetario
Opañel 11 Alto del Arenal
Legazpi
Plaza Miguel Hernández Rivas Vaciamadrid
Aviación Española Eliptica Almendrales
Abrantes
Hospital 12 de Octubre Sierra de Guadalupe
Pan Bendito San Fermin-Orcasur Villa de Vallecas La Poveda
Cuatro Vientos
San Francisco Ciudad de los Ángeles Congosto
Carabanchel Alto Villaverde Bajo-Cruce La Gavia 9 Arganda del Rey
Joaquin Vilumbrales San Cristóbal Las Suertes
11 La Peseta
3 Villaverde Alto 1 Valdecarros
Puerta Leganés Hospital Casa del Julián
del Sur 12 San Nicasio Central Severo Ochoa Reloj Besteiro El Carrascal El Bercial
10
Metro de Madrid Los Espartales
Parque Lisboa El Casar
© 2007 Designed and drawn by Matthew McLauchlin, http://www.metrodemontreal.com/ 12
Alcorcón Central This version released under Creative Commons Share-Alike Attribution Licence (CC-SA-BY 2.5) Juan de la Cierva
Parque Oeste http://creativecommons.org/licenses/by-sa/2.5/ Getafe Central
Universidad Rey Juan Carlos Not affiliated with, released by, or approved of by the Alonso de Mendoza
Móstoles Central Consorcio Regional de Transportes de Madrid (http://www.metromadrid.es/) Conservatorio
Pradillo Arroyo Culebro
12
Hospital Manuela Loranca Hospital de Parque Fuenlabrada Parque de los
de Móstoles Malasaña Fuenlabrada Europa Central Estados
FIGURE 2.4
Map of the Madrid metro system. (Images licensed under Creative Commons Share-Alike.)
16 Handbook of Educational Data Mining
Following these principles is the key to build what Tufte calls the graphical excellence, and
it consists in “giving the viewer the greatest number of ideas in the shortest time with the
least ink in the smallest space” [18].
A key question in IV is how we convert abstract data into a graphical representation,
preserving the underlying meaning and, at the same time, providing new insight. There
is no “magic formula” that helps the researchers to build systematically a graphical repre-
sentation starting from a raw set of data. It depends on the nature of the data, the type of
information to be represented and its use, but more consistently, it depends on the creativ-
ity of the designer of the graphical representation. Some interesting ideas, even if innova-
tive, have often failed in practice.
Graphics facilitate IV, but a number of issues must be considered [16,18]:
2.4.1.1 UM/QV
QV [6] is an overview interface for UM [5], a toolkit for cooperative user modeling. A model
is structured as a hierarchy of elements of the domain. QV uses a hierarchical representa-
tion of concepts to present the user model. For instance, Figure 2.5 gives a graphical rep-
resentation of a model showing concepts of the SAM text editor. It gives a quick overview
whether the user appears to know each element of the domain. QV exploits different types
Visualization in Educational Environments 17
more_useful set_fname_k
Sam write_k
Editors Mouse
Other
command_window
Root exch_k
search_k
very_useful
mouse xerox_k
Powerful
mostly_useless
emacs
vi
c_c
Programming pascal_c
languages
lisp_c
fortran_c
typing_ok_c
user_info
input wpm_info
FIGURE 2.5
The QV tool showing a user model. (Image courtesy of Judy Kay.)
2.4.1.2 ViSMod
ViSMod [22] is an interactive visualization tool for the representation of Bayesian learner
models. In ViSMod, learners and instructors can inspect the learner model using a
graphical representation of the Bayesian network. ViSMod uses concept maps to render a
18 Handbook of Educational Data Mining
FIGURE 2.6
A screenshot of ViSMod showing a fragment of a Bayesian student model in the area of biology cell. (Image
courtesy of Diego Zapata-Rivera.)
Bayesian student model and various visualization techniques such as color, size proximity
link thickness, and animation to represent concepts such as marginal probability, changes
in probability, probability propagation, and cause–effect relationships (Figure 2.6). One
interesting aspect of this model is that the overall belief of a student knowing a particular
concept is captured taking into account the students’ opinion, the instructors’ opinion, and
the influence of social aspects of learning on each concept. By using VisMod, it is possible
to inspect complex networks by focusing on a particular segment (e.g., zooming or scroll-
ing) and using animations to represent how probability propagation occurs in a simple
network in which several causes affect a single node.
2.4.1.3 E-KERMIT
KERMIT (Knowledge-based Entity Relationship Modelling Intelligent Tutor) [17] is a
knowledge-based intelligent tutoring system aimed at teaching conceptual database
design for university level students. KERMIT teaches the basic entity-relationship (ER)
database modeling by presenting to the student the requirements for a database, and the
student has to design an ER diagram for it. E-KERMIT is an extension of KERMIT devel-
oped by Hartley and Mitrovic [3] with an open student model. In E-KERMIT the student
may examine with a dedicated interface the global view of the student model (see Figure
2.7). The course domain is divided in categories, representing the processes and concepts
in ER modeling. In the representation of the open student model concepts of the domain
are mapped with histograms. The histogram shows how much of the concrete part of the
domain the student knows correctly (in black) or incorrectly (in gray) and the percentage
of covered on the concepts of the category. For instance, the example shows that the stu-
dent covered 32% of the concepts of the category Type, and has scored 23% out of a possible
32% on this category. This means that the student’s performance on category type so far is
77% (23/320â•›×â•›100).
Visualization in Educational Environments 19
FIGURE 2.7
The main view of a student’s progress in E-KERMIT. Progress bars indicate how much the student compre-
hends each category of the domain. (Image reproduced with permission of Tanja Mitrovic.)
2.4.2.1 Simuligne
Simuligne [12] is a research project that uses social network analysis [14] to monitor
group communications in distance learning in order to help instructors detect collabora-
tion problems or slowdown of group interactions. Social network analysis is a research
field that “characterize the group’s structure and, in particular, the influence of each of
the members on that group, reasoning on the relationship that can be observed in that
group” (Reffay and Chanier [12], p. 343). It provides both a graphical and a mathematical
analysis of interactions between individuals. The graphical version can be represented
with a network, where the nodes in the network are the individuals and groups while
the links show relationships or flows between the nodes. The social network analysis can
help to determine the prominence of a student respect to others, and other social net-
work researcher measures, such as the cohesion factor between students. The cohesion
is a statistical measure that represents how much the individuals socialize in a group
that shares goals and values. Reffay and Chanier applied this theory to a list of e-mails
exchanged in a class of distance learners. Figure 2.8 illustrates the graphical representa-
tion of the e-mail graph for each learning group. We can see for instance that there is
no communication with Gl2 and Gl3, or the central role of the tutor in the discussions
(node Gt).
20 Handbook of Educational Data Mining
Gl1
28 25
2 4 Gl
2 20 2 19 12 17 4 2 9 3 2 1
1 1 2 9 4 2 13
Gn2 22 Gl6 24
3 3 8 10
3 3 Gl4 3
4 5
Gn1
FIGURE 2.8
The communication graph of the e-mail exchanged within groups in Simuligne. (Image courtesy of Christophe
Reffay.)
2.4.2.2 PeopleGarden
PeopleGarden [21] uses a flower and garden metaphor to visualize participations on a
message board. The message board is visualized as a garden full of flowers. Each flower
represents one individual. The height of flower denotes amount of time a user has been
at the board and its petals his postings. Initial postings are shown in red, replies in blue.
An example is represented in Figure 2.9. The figure can help the instructor of a course to
quickly grasp the underlying situation, such as a single dominant member in discussion
on the left or a group with many members at different level of participation on the right.
FIGURE 2.9
The PeopleGarden visual representations of participation on a message board. (Image by Rebecca Xiong and
Judith Donath, © 1999 MIT media lab.)
to the instructor of the course, but it is commonly presented in the format of a textual log
file, which is inappropriate for the instructor’s needs [8]. To this end, since the log data is
collected in a format that is suitable to be analyzed with IV techniques and tools, a number
of approaches have been proposed to graphically represent the tracking data generated
by a CMS.
Recently, a number of researches that exploits graphical representations to analyze the
student tracking have been proposed. ViSION [15] is a tool that was implemented to dis-
play student interactions with a courseware website designed to assist students with their
group project work. CourseVis [10] is another application that exploits graphical repre-
sentations to analyze the student tracking data. CourseVis is a visual student tracking
tool that transforms tracking data from a CMS into graphical representations that can be
explored and manipulated by course instructors to examine social, cognitive, and behav-
ioral aspects of distance students. CourseVis was started from a systematic investigation
aimed to find out what information about distance students the instructors need when
they run courses with a CMS, as well as to identify possible ways to help instructors
acquire this information. This investigation was conducted with a survey, and the results
were used to draw the requirements and to inform the design of the graphical representa-
tions. One of the (several) graphical representations produced by CourseVis is reported in
Figure 2.10.
This comprehensive image represents in a single view the behaviors of a specific stu-
dent in an online course. It takes advantage of single-axis composition method (multiple
variables share an axis and are aligned using that axis) for presenting large number of
variables in a 2D metric space. With a common x-axis mapping the dates of the course,
a number of variables are represented. The information represented here are namely the
student’s access to the content pages (ordered by topics of the course), the global access
to the course (content pages, quiz, discussion, etc.), a progress with the schedule of the
course, messages (posted, read, follow-ups), and the submission of quizzes and assign-
ments. For a detailed description of CourseVis, see Mazza and Dimitrova [10].
22
Summary of student’s behaviors from 2002-01-15 to 2002-04-11
Student: Francesco
Variable
Threads
String
Program structure
Package
Overriding
Overloading
Object serialization
Object
Method
Interface
Inner class
Access to Inheritance
I/O streams
content pages File
Exception
by topics Data file
Control flow
Constructor
Class variable
Class method
Class
Basic concepts
Array
Argument
Applet
Access level
Abstract method
Abstract class
Graphic libraries
AWT structure
AWT contexts
AWT components
AWT events
––––
30
25
20
Logins
15
10
5
0
30-Oct-04 14-Nov-04 29-Nov-04 14-Dec-04 29-Dec-04 13-Jan-05 28-Jan-05 12-Feb-05 27-Feb-05 14-Mar-05 29-Mar-05
Time
FIGURE 2.11
A graph reporting an overview of students’ accesses to resources of the course in GISMO.
The produced visualizations were evaluated with an empirical study [8], which has
shown that graphical representations produced with it can help instructors to identify
individuals who need particular attention, discover patterns and trends in accesses and
discussions, and reflect on their teaching practice. However, it revealed some limitations,
such as the adoption of 3D graphics in one of the graphical representations (which was
considered too problematic for the typical users of the systems), and the lack of full inte-
gration with the CMS.
For these reasons, the ideas behind CourseVis were applied in a plug-in for the CMS
open source Moodle called GISMO [11]. GISMO is developed as a plug-in module fully
integrated with the CMS and is available for download at http://gismo.sourceforge.net.
Figure 2.11 shows the students’ accesses to the course. It reports a graph of the accesses
to the course. A simple matrix formed by students’ names (on Y-axis) and dates of the
course (on X-axis) is used to represent the course accesses. Each blue square represents
at least one access to the course made by the student on the selected date. The histogram
at the bottom shows the global number of hits to the course made by all students on each
date. The instructor has an overview of the global student access to the course with a clear
identification of patterns and trends.
Figure 2.12 reports an overview of students’ accesses to resources of the course, student
names on the Y-axis and resource names on the X-axis. A mark is depicted if the student
accessed this resource, and the color of the mark ranges from light color to dark color,
according to the number of times he/she accessed this resource. With this picture, the
instructor has an overview of the student accesses to the pages course with a clear identi-
fication of most (and last) accessed resources.
GISMO is described in detail in Mazza and Milani [9] and Mazza and Botturi [11].
24 Handbook of Educational Data Mining
(1) Exercise s
(2) Lecture 1
(3) Lecture 2
(4) Example Co
(5) Exercise s
(6) Exercise s
(7) Assignment
(8) Lecture 3
(9) Lecture 4
(10) Example Co
(11) Assignment
(12) Exercise s
(13) Assignment
(14) Exercise s
(15) Lecture 5
(16) Exercise s
(17) Example Co
(18) Error in E
(19) Lecture 6
(20) Example Co
(21) Exercise s
(22) Lecture 7
(23) Example Co
(24) Exercise s
(25) Lecture 9
(26) Lecture 8
(27) Example Co
(28) Assignment
(29) Example Co
(30) Project 1:
(31) Project 2:
(32) Project 3:
(33) Project 4:
(34) Exercise S
(36) Viewport G
(37) Lecture 9
(38) Example Co
(39) Exercise S
(40) Lecture 11
(41) Example Co
(35) Group Proj
Resources
FIGURE 2.12
A graph reporting an overview of students’ accesses to resources of the course.
2.5╇ Conclusions
Online educational systems collect a vast amount of information that is valuable for ana-
lyzing students’ behavior. However, due to the vast amount of data these systems generate,
it is very difficult to manage manually [13]. In the last few years, researchers have begun to
investigate various methods to extract valuable information from this huge amount of data
that might be helpful to instructors to manage their courses. Data mining and IV could be
two approaches that might help to uncover new, interesting, and useful knowledge based
on students’ usage data [13].
In this chapter, we presented some ideas and principles of using IV techniques to
graphically render students’ data collect by web-based educational systems. Thanks to
our visual perceptual abilities, we have seen how graphical representations may be very
useful to quickly discover patterns, regularities, and trends on data and provide a useful
overview of the whole dataset. We described some systems where visualizations have
been used in learning environments to represent user models, discussions, and tracking
data.
It is recognized that data mining algorithms alone are not enough, as well as it is rec-
ognized that it is infeasible to consider graphics displays alone an effective solution to the
analysis of complex and large data sets. The two approaches data mining and IV could
be combined together in order to build something that is greater than the sum of the
Visualization in Educational Environments 25
parts. Future research has to consider these two approaches and merge toward a unified
research stream.
References
1. Bertin, J. (1981). Graphics and Graphic Information Processing. Walter de Gruyter, Berlin, Germany.
2. Card K. S., Mackinlay J. D., and Shneiderman B. (1999). Readings in Information Visualization,
Using Vision to Think. Morgan Kaufmann, San Francisco, CA.
3. Hartley, D. and Mitrovic, A. (2002). Supporting learning by opening the student model. In
Intelligent Tutoring Systems, Proceedings of the 6th International Conference, ITS 2002, Biarritz,
France and San Sebastian, Spain, June 2–7, 2002, volume 2363 of Lecture Notes in Computer
Science, pp. 453–462. Springer, Berlin, Germany.
4. Hermann, I., Melancon, G., and Marshall, M. S. (2000). Graph visualisation and navigation in
information visualisation. IEEE Transaction on Visualization and Computer Graphics, 1(6):24–43.
http://www.cwi.nl/InfoVisu/Survey/StarGraphVisuInInfoVis.html
5. Kay, J. (1995). The um toolkit for cooperative user modelling. User Modeling and User Adapted
Interaction, 4:149–196.
6. Kay, J. (1999). A scrutable user modelling shell for user-adapted interaction. PhD thesis, Basser
Department of Computer Science, University of Sydney, Sydney, Australia.
7. Larkin, J. H. and Simon, H. A. (1987). Why a diagram is (sometimes) worth ten thousand words.
In Glasgow, J., Narayahan, H., and Chandrasekaram, B., editors, Diagrammatic Reasoning—
Cognitive and Computational Perspectives, pp. 69–109. Cambridge, MA: AAAI Press/The MIT
Press. 1995. Reprinted from Cognitive Science, 11:65–100, 1987.
8. Mazza, R. (2004). Using information visualization to facilitate instructors in web-based distance
learning. PhD thesis dissertation, Faculty of Communication Sciences, University of Lugano,
Lugano, Switzerland.
9. Mazza, R. and Milani, C. (2005). Exploring usage analysis in learning systems: Gaining insights
from visualisations. In Workshop on Usage Analysis in Learning Systems. 12th International Conference
on Artificial Intelligence in Education (AIED 2005), pp. 65–72. Amsterdam, the Netherlands, July 18,
2005.
10. Mazza, R. and Dimitrova, V. (2007). CourseVis: A graphical student monitoring tool for facilitat-
ing instructors in web-based distance courses. International Journal in Human-Computer Studies
(IJHCS), 65(2):125–139. Elsevier Ltd.
11. Mazza, R. and Botturi, L. (2007). Monitoring an online course with the GISMO tool: A case
study. Journal of Interactive Learning Research, 18(2):251–265. Chesapeake, VA: AACE.
12. Reffay, C. and Chanier, T. (2003) How social network analysis can help to measure cohesion
in collaborative distance learning?. Proceeding of the Computer Supported Collaborative Learning
Conference (CSCL’03), p. 343–352. Kluwer Academic Publishers, Bergen, Norway.
13. Romero, C. et al. (2008). Data mining in course management systems: Moodle case study and
tutorial. Computers & Education, 51(1):368–384.
14. Scott, J. (1991). Social Network Analysis. A Handbook. SAGE Publication, London, U.K.
15. Sheard, J., Albrecht, D., and Butbul, E. (2005) ViSION: Visualizing student interactions online.
In The Eleventh Australasian World Wide Web Conference (AusWeb05). Queensland, Australia, July
2–6, 2005.
16. Spence R. (2001). Information Visualisation. Addison-Wesley, Harlow, U.K., 2001.
17. Suraweera, P. and Mitrovic, A. (2002). Kermit: A constraint-based tutor for database modeling.
In ITS 2002, Proceedings of 6th International Conference, volume 2363 of Lecture Notes in Computer
Science, pp. 377–387. Springer-Verlag, Berlin, Germany.
18. Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT.
26 Handbook of Educational Data Mining
19. Uther, J. (2001). On the visualisation of large user models in web based systems. PhD thesis,
School of Information Technologies, University of Sydney, Sydney, Australia.
20. Ware, C. (1999). Information Visualization. Perception for Design. Morgan Kaufmann Series in
Interactive Technologies. Morgan Kaufmann, San Francisco, CA.
21. Xiong, R. and Donath, J. (1999). Peoplegarden: Creating data portraits for users. In Proceedings
of the 12th Annual ACM Symposium on User Interface Software and Technology. November 7–10,
1999, Asheville, NC.
22. Zapata-Rivera, J. D. and Greer, J. E. (2001). Externalising learner modelling representations.
In Workshop on External Representations in AIED: Multiple Forms and Multiple Roles, pp. 71–76.
Held with the AI-ED 2001—10th International Conference on Artificial Intelligence in Education,
San Antonio, TX, May 19–23, 2001.
23. Mazza, R. and Dimitrova, V. (2005). Generation of graphical representations of student track-
ing data in course management systems. In Proceedings of the 9th International Conference on
Information Visualisation, London, U.K., July 6–8, 2005.
3
Basics of Statistical Analysis of Interactions
Data from Web-Based Learning Environments
Judy Sheard
Contents
3.1 Introduction........................................................................................................................... 28
3.2 Studies of Statistical Analysis of Web Log Files............................................................... 29
3.3 Web Log Files........................................................................................................................ 30
3.3.1 Log File Data.............................................................................................................. 30
3.3.2 Log File Data Abstractions...................................................................................... 31
3.4 Preprocessing Log Data....................................................................................................... 32
3.4.1 Data Cleaning............................................................................................................ 32
3.4.1.1 Removal of Irrelevant Data....................................................................... 32
3.4.1.2 Determining Missing Entries................................................................... 32
3.4.1.3 Removal of Outliers................................................................................... 32
3.4.2 Data Transformation................................................................................................ 33
3.4.3 Session Identification................................................................................................ 33
3.4.4 User Identification..................................................................................................... 33
3.4.5 Data Integration........................................................................................................34
3.5 Statistical Analysis of Log File Data...................................................................................34
3.5.1 Descriptive Statistics................................................................................................ 35
3.5.1.1 Describing Distributions........................................................................... 35
3.5.1.2 Relationships between Variables............................................................. 36
3.5.1.3 Graphical Descriptions of Data................................................................ 37
3.5.2 Inferential Statistics.................................................................................................. 37
3.5.2.1 Testing for Differences between Distributions Using
Parametric Tests......................................................................................... 37
3.5.2.2 Testing for Differences between Distributions Using
Nonparametric Tests.................................................................................. 38
3.5.2.3 Testing for Relationships........................................................................... 39
3.6 Conclusions............................................................................................................................ 39
References........................................................................................................................................ 40
27
28 Handbook of Educational Data Mining
3.1╇ Introduction
With the enormous growth in the use of Web-based learning environments, there is a need
for feedback and evaluation methods that are suitable for these environments. Traditional
methods used to monitor and assess learning behavior are not necessarily effective or
appropriate for an electronic learning environment. However, Web technologies have
enabled a new way of collecting data about learning behavior. The automatic capture and
recording of information about student interactions using computer software processes
provides a rich source of data. In the 1980s, Zuboff [1], a social scientist, proposed that
computer-generated interactions could be used to provide information about online learn-
ing processes. Zuboff recognized that as well as automating tasks, computers also generate
information about the processing of those tasks. She termed this process informating and
predicted that the capacity for a machine to “informate” about its actions would ultimately
reconfigure the nature of work. Early research studies of computer interactions captured
keystroke type data, which provided information about general technology usage. The
focus of studies of computer interactions has now shifted from considering the technology
in isolation to understanding the user in relationship to the technology.
Data of student interactions with a Web-based learning environment may be stored on
a Web server log as part of normal server activity or may be captured by software pro-
cesses. A common technique used is instrumentation in which additional code is added to
webpage scripts to capture details of interactions [2]. The continuous collection of interac-
tions data enables longitudinal views of learning behavior rather than just snapshots or
summaries, which are provided by other data collection techniques. Another advantage of
capturing interactions data is that it records what students have actually done rather than
relying on their recollections or perceptions, thus eliminating problems of selective recall
or inaccurate estimation [3,4]. Collecting data on log files is also an efficient technique
for gathering data in the learning situation. This technique is transparent to the student
and does not affect the learning process [5]. Log file data collection and recording fits the
paradigm of naturalistic observation and enquiry [6]. However, monitoring use of a learn-
ing environment via log files raises the issue of privacy, particularly in cases where the
students can be identified.
The main limitations with collecting data of student interactions lie in the management
and interpretation of the data. Recording the online interactions of a group of students
over the term of their course generates a huge volume of data. Although this data con-
tains records of students’ learning behavior, deducing information beyond what they have
done, when it was done and how long it took, is difficult. Combining the interactions data
with other data such as learning outcomes can help provide a fuller picture [7,8].
These issues have provided impetus for the development of methods to manage and
interpret student interactions data. These methods have drawn mainly from the data min-
ing domain. The application of data mining techniques to Web log file data has led to the
development of a field called Web usage mining, which is the study of user behavior when
navigating on the Web. Mobasher [9] states that the “goal of Web usage mining is to cap-
ture and model the behavioral patterns and profiles of users interacting with a Web site”
(p. 3). In some respects, Web usage mining is the process of reconciling the Web site devel-
oper’s view of how the Web site should be used with the way it is actually used [10]. Web
usage mining typically involves preprocessing, exploration, and analysis phases. The data
can be collected from Web log files, user profiles, and any other data from user interactions
[11]. This typically forms a large and complex data set.
Basics of Statistical Analysis of Interactions Data from Web-Based Learning Environments 29
whereas females used passive resources more than males. Another study by Comunale
et al. [22] found that females perceived that the courseware Web site contributed more to
their learning.
An important aspect of an investigation of student usage of a courseware Web site is
determining how the usage relates to task completion or learning outcomes. The study
by Comunale et al. [22] found evidence to suggest that higher course grades are related to
more frequent Web site use. Other examples of studies of the relationship between Web
site usage and learning are by Feng et al. [23], Gao and Lehman [18], and Zhu et al. [24].
working in and may be used to determine the type of action that the user had
performed.
• Interaction time. This specifies the date and time of the interaction. This informa-
tion may be used to calculate the time lapses between different interactions.
Client side logging makes possible the recording of additional data, enabling more sophis-
ticated and detailed analysis of Web site usage and user behavior. The data recorded will
depend on the application being monitored and the process used to record the data. The
following data are useful to record:
• Session identification. This identifies the user session in which the interaction
occurred. Identification of a session allows the interactions to be associated with a
single session and a user who may or may not be identified.
• User identification. This identifies the user who initiated the interaction.
Identification of users allows comparisons to be made between users based on
available demographic information.
Other data may be recorded that provide additional information about the interaction or
the user. These may be obtained from other sources and will be specific to each application.
1. Page view—The lowest and most common level of data abstraction. A page view
is equivalent to a single interaction stored on the log file and results from a single
user request. This is suitable for fine-grained analysis of Web site usage.
2. Session—A sequence of interactions representing a user visit to the Web site. This
can be seen as a temporal abstraction of the log file data and is suitable for a coarse-
grained analysis of Web site usage. Note that this definition of a session is similar
to the definition of a server session in the Web Characterization Terminology &
Definitions Sheet [25].
3. Task—A sequence of contiguous page accesses to one resource on a Web site within
a user session. A resource is a page or group of linked pages that is identified logi-
cally as a function or facility within the Web site domain. This is more specific
than the general definition of resource as specified in the Web Characterization
Terminology & Definitions Sheet, which defines a resource as: “anything that has
identity” [25]. This abstraction can be seen as a resource abstraction.
4. Activity—A sequence of semantically meaningful page accesses that relate to a
particular activity or task that the user performs. This is, in effect, an abstraction
of a discrete behavioral activity of the user.
Mobasher [9] describes three levels of abstraction that are typically used for analysis of
log file data: page view, session, and transaction. The page view and session abstractions
described above are the same as those defined by Mobasher [9] and are explicitly defined
in the Web domain. The activity abstraction relates to the transaction abstraction described
by Mobasher.
those that should be rejected is not always easy. This requires knowledge of the domain in
which the data was collected and depends on the aims of the analysis [30].
In the context of log file data, it may be important to identify and exclude outliers from
analysis. For example, log files may contain long periods between successive interactions
within sequences of user interactions. This typically occurs because the user pauses from
their work for some reason. The inclusion of sequences containing long pauses can give
misleading results when calculating time spent at a Web site and it is important to filter
out these sequences before analysis. Claypool et al. [21] used a time limit between succes-
sive interactions of approximately 20â•›min to filter out sessions with these pauses, arguing
that after this time it was likely that a user had stopped reading the page. Sheard [2] used
a limit of 10â•›min based on an empirical study of access times. As another example of outli-
ers, log files may contain sessions with unusually long streams of interactions. Giudici [31]
filtered out the 99th percentile of the distributions of both sequence length and session
times based on results of an exploratory study.
used. However, this is problematic for several reasons. A user may access a Web site from
different computers, or several users may access a Web site from the one computer [28].
The use of proxy servers causes problems with user identification as all requests have the
same identifier even when they originate from different users. At best, an IP address will
provide a rough way to distinguish between users but not to identify them. Cooley et al.
[10] suggest several heuristics which can be used to help distinguish between users within
sequences of interactions. These include looking for a change in browser or operating sys-
tem software or looking for impossible navigation steps.
These levels form a hierarchy of increasing information held in variables at each level of
measurement. For each level there are different descriptive and statistical techniques that
can be applied to the data.
We will now give an overview of the different statistical techniques that can be used on
data from Web log files. All the tests presented for interval scale data will also apply to
ratio scale data, therefore the discussion will only mention interval scale and the applica-
tion to ratio scale data will be implied.
• Median—The point on the scale below which half of the values lie. This is used for
ordinal data and for interval data which has an asymmetrical distribution or the
distribution contains outlying points.
• Mode—The value with the most frequent occurrence. This is typically used for
nominal data.
3.5.1.1.2╇ Dispersion
The measure of dispersion indicates the degree of spread of the distribution. This is only
used for ordinal and interval scale data. There are several measures which may be used:
• Range: The difference between the maximum and minimum values on the scale
of measurement. This gives a measure of the spread of values but no indication of
how they are distributed.
• Interquartile range: The data values are divided into four equal sections and the
interquartile range is the range of the middle half of the data set. This is a useful
measure if the data set contains outlying values.
• Variance: The average of the squared deviations from the mean of the distribution,
where the deviation is the difference between the mean of the distribution and a
value in the distribution. This is used for interval scale data.
• Standard deviation: The positive square root of the variance. This is used for inter-
val scale data.
3.5.1.1.3╇ Shape
The shape of the distribution is important for determining the type of analysis that can be
performed on the data. An important shape for interval scale data is the normal distribu-
tion or “bell curve.” Normal distributions are special distributions with certain character-
istics, which are required for tests on interval scale variables. Other relevant descriptions
of the shape of a distribution are as follows:
then r2 equals 0.57, which means that 57% of the total variation in exam performance can
be explained by the linear relationship between exam performance and interactions with
the study material.
• Parametric analyses: Used for data on the interval scale which meet specific
assumptions.
• Nonparametric analyses: Typically used for nominal or ordinal scale variables.
In analysis of log file data, we are often interested in testing for differences between
variables, for example, “Is the there any difference in the number of accesses to learning
resources between the male and female students?” We may also be interested in relation-
ships between variables. We will now give an overview of analysis for testing for differ-
ences and testing for relationships.
TABLE 3.1
Parametric Tests for Difference
Statistical Test Hypothesis Tested Example
t-Test There is no difference between the mean values of There is no difference in the access
(independent- two distributions. times for course materials between
samples) the on-campus and distance
education students.
t-Test There is no difference between the means of two There is no difference in the mid
(paired- distributions of related variables. The paired semester and final exam results for a
samples) samples may be formed when one group of class of students.
students has been tested twice or when students
are paired on similar characteristics.
ANOVA There is no difference between the mean values of There is no difference in mean weekly
(one-way) two or more distributions. Only one factor online access times between students
(independent variable) is tested. across five degree programs.
ANOVA There is no difference between the means values of 1. There is no difference in mean
(two-way) two or more distributions. In this case, two weekly online access times
factors (A and B) are included in the analysis. The between students across five
possible null hypotheses are: degree programs.
2. There is no difference in mean
1. There is no difference in the means of factor A. weekly online access times
2. There is no difference in means of factor B. between male and female students
3. There is no interaction between factors A and B. in these degree programs.
3. The weekly access times of male
and female students are not
influenced by program of study.
The data and distributions we are testing should meet the following assumptions:
• Data values are independent. This does not apply to paired-samples t-tests.
• Data values are selected from a normally distributed population. A normal distri-
bution will have one peak and low absolute values for skewness and kurtosis. A
test that may be used to test for normality is the Shapiro–Wilk test [41,42].
• Homogeneity of variance: The variance across two or more distributions is equal.
This may be tested Levene’s test [41,42]. See Feng and Heffernan [46] for examples
of parametric tests.
TABLE 3.2
Nonparametric Tests for Difference
Statistical Test Hypothesis Tested
χ2 test for independence Two variables are independent.
Mann–Whitney U There is no difference between two distributions. Mann–Whitney U is the
nonparametric equivalent of the independent samples t-test.
Wilcoxon matched-pairs There is no difference between two distributions of related samples. This
signed rank test is the nonparametric equivalent of the paired-samples t-test.
Kruskal–Wallis There is no difference between the values from three or more distributions.
This is the nonparametric equivalent of the one-way ANOVA.
TABLE 3.3
Correlation Tests
Level of Measurement Correlation Coefficient
Both distributions on the interval scale Pearson product-moment
Both distributions on the ordinal scale Spearman rank order
Both distributions on the nominal Contingency coefficient
scale
In testing for differences in distributions of nominal variables, we are testing that the
relative proportions of values of one variable are independent from the relative propor-
tions of values of another variable. For example, “Are the proportions of accesses to each
resource the same for each day of the week?”
3.6╇ Conclusions
This chapter has outlined common techniques for descriptive and inferential statistical
analysis, which may be applied to log file data. Also explained are processes necessary
for preparing the log file data for analysis using standard statistical software. Statistical
analysis of log file data of student interactions can be used to provide information about
Web site usage, resource usage, learning behavior, and task performance.
40 Handbook of Educational Data Mining
References
1. Zuboff, S., In the Age of the Smart Machine: The Future of Work and Power. Basic Books, New York,
1984, 468 pp.
2. Sheard, J., An Investigation of Student Behaviour in Web-Based Learning Environments, in Faculty of
Information Technology. PhD thesis, Monash University, Melbourne, Australia, 2006.
3. Nachmias, R. and Segev, L., Students’ use of content in Web-supported academic courses. The
Internet and Higher Education 6: 145–157, 2003.
4. Yi, M.Y. and Hwang, Y., Predicting the use of Web-based information systems: Efficacy, enjoy-
ment, learning goal orientation, and the technology acceptance model. International Journal of
Human-Computer Studies 59: 431–449, 2003.
5. Federico, P.-A., Hypermedia environments and adaptive instruction. Computers in Human
Behavior 15(6): 653–692, 1999.
6. Sarantakos, S., Social Research, 2nd edn. MacMillan Publishers Australia Pty Ltd., Melbourne,
Australia, 1998.
7. Ingram, A.L., Using Web server logs in evaluating instructional Web sites. Journal of Educational
Technology Systems 28(2): 137–157, 1999–2000.
8. Peled, A. and Rashty, D., Logging for success: Advancing the use of WWW logs to improve
computer mediated distance learning. Journal of Educational Computing Research 21(4): 413–431,
1999.
9. Mobasher, B., Web Usage Mining and Personalization, in Practical Handbook of Internet Computing,
M.P. Singh, Editor. CRC Press, Boca Raton, FL, 2004, pp. 1–35.
10. Cooley, R., Mobasher, B., and Srivastava, J. Web mining: Information and pattern discovery
on the World Wide Web. In Proceedings of 9th IEEE International Conference Tools and Artificial
Intelligence (ICTAI’97), Newport Beach, CA, November 1997, pp. 558–567.
11. Kosala, R. and Blockeel, H., Web mining research: A survey. ACM SIGKDD Explorations 2(1):
1–15, 2000.
12. Romero, C. and Ventura, S., Educational data mining: A survey from 1995 to 2005. Expert
Systems with Applications 33: 135–146, 2007.
13. Zaïane, O.R., Xin, M., and Han, J., Discovering Web access patterns and trends by applying
OLAP and data mining technology on web logs. In Proceedings of Advances in Digital Libraries,
Santa Barbara, CA, April 1998, pp. 19–29.
14. Avouris, N., Komis, V., Fiotakis, G., Margaritis, M., and Voyiatzaki, E., Logging of fingertip
actions is not enough for analysis of learning activities. In Proceedings of 12th International
Conference on Artificial Intelligence in Education, Amsterdam, the Netherlands, July 2005,
pp. 1–8.
15. Cockburn, A. and McKenzie, B., What do Web users do? An empirical analysis of Web use.
International Journal of Human-Computer Studies 54(6): 903–922, 2001.
16. Hwang, W.-Y. and Li, C.-C., What the user log shows based on learning time distribution.
Journal of Computer Assisted Learning 18: 232–236, 2002.
17. Nilakant, K. and Mitovic, A., Application of data mining in constraint-based intelligent tutor-
ing systems. In Proceedings of Artificial Intelligence in Education. Amsterdam, the Netherlands,
July 2005, pp. 896–898.
18. Gao, T. and Lehman, J.D., The effects of different levels of interaction on the achievement and
motivational perceptions of college students in a Web-based learning environment. Journal of
Interactive Learning Research 14(4): 367–386, 2003.
19. Monk, D., Using data mining for e-learning decision making. Electronic Journal of e-Learning
3(1): 41–45, 2005.
20. Grob, H.L., Bensberg, F., and Kaderali, F., Controlling open source intermediaries—A Web log
mining approach. In Proceedings of International Conference on Information Technology Interfaces,
Zagreb, Croatia, 2004, pp. 233–242.
Basics of Statistical Analysis of Interactions Data from Web-Based Learning Environments 41
21. Claypool, M., Brown, D., Le, P., and Waseda, M., Inferring user interest. IEEE Internet Computing
5(6): 32–39, 2001.
22. Comunale, C.L., Sexton, T.R., and Voss, D.J.P., The effectiveness of course Web sites in higher
education: An exploratory study. Journal of Educational Technology Systems 30(2): 171–190,
2001–2002.
23. Feng, M., Heffernan, N., and Koedinger, K., Looking for sources of error in predicting student’s
knowledge. In Proceedings of AAAI Workshop on Educational Data Mining, Menlo Park, CA, 2005,
pp. 1–8.
24. Zhu, J.J.H., Stokes, M., and Lu, A.X.Y., The use and effects of Web-based instruction: Evidence
from a single-source study. Journal of Interactive Learning Research 11(2): 197–218, 2000.
25. W3C. Web Characterization Terminology & Definitions Sheet. 1999 [cited December 29, 2009].
Available from: http://www.w3.org/1999/05/WCA-terms/
26. Cooley, R., Mobasher, B., and Srivastava, J., Data preparation for mining World Wide Web
browsing patterns. Knowledge and Information Systems 1(1): 1–27, 1999.
27. Joshi, K.P., Joshi, A., Yesha, Y., and Krishnapuram, R., Warehousing and mining Web logs. In
Proceedings of ACM CIKM’99 2nd Workshop on Web Information and Data Management (WIDM‘99)
Conference, Kansas City, MO, 1999, pp. 63–68.
28. Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N., Web usage mining: Discovery and
applications of usage patterns from Web data. SIGKDD Explorations 1(2): 12–23, 2000.
29. Gumbel, E.J., Discussion on “Rejection of Outliers” by Anscombe, F.J. Technometrics 2: 165–
166, 1960.
30. Redpath, R. and Sheard, J., Domain knowledge to support understanding and treatment of
outliers. In Proceedings of International Conference on Information and Automation (ICIA 2005),
Colombo, Sri Lanka, 2005, pp. 398–403.
31. Giudici, P., Applied Data Mining: Statistical Methods for Business and Industry. John Wiley & Sons
Inc., West Sussex, U.K., 2003.
32. Ribeiro, M.X., Traina, A.J.M., and Caetano Traina, J., A new algorithm for data discretization
and feature selection. In Proceedings of the 2008 ACM Symposium on Applied Computing. ACM,
Fortaleza, Brazil, 2008.
33. Piramuthu, S., Evaluating feature selection methods for learning in data mining applications.
European Journal of Operational Research 156: 483–494, 2004.
34. Catledge, L.D. and Pitkow, J.E., Characterizing browsing strategies in the World-Wide Web.
Computer Networks and ISDN Systems 27(6): 1065–1073, 1995.
35. Cohen, L.B., A two-tiered model for analyzing library website usage statistics, Part 1: Web
server logs. Portal: Libraries and the Academy 3(2): 315–326, 2003.
36. Chan, P.K. Constructing Web user profiles: A non-invasive learning approach to building Web
user profiles. In Proceedings of Workshop on Web Usage Analysis and User Profiling (WEBKDD‘99),
San Diego, CA, 1999, pp. 39–55.
37. Eirinaki, M. and Vazirgiannis, M., Web mining for Web personalization. ACM Transactions on
Internet Technology 3(1): 1–27, 2003.
38. Burns, R., Introduction to Research Methods. 4th edn. Sage Publications Ltd., London, U.K., 2000.
39. Wiersma, W., Research Methods in Education: An Introduction, 6th ed. Allyn & Bacon, Boston, MA,
1995, 480 pp.
40. Kranzler, J.H., Moursund, J., and Kranzler, J., Statistics for the Terrified, 4th ed. Pearson Prentice
Hall, Upper Saddle River, NJ, 2007, 224 pp.
41. Brace, N., Kemp, R., and Snelgar, R., SPSS for Psychologists. 3rd ed. Palgrave McMillan,
Hampshire, U.K., 2006.
42. Coakes, S.J., Steed, L., and Ong, C., SPSS version 16.0 for Windows: Analysis without Anguish.
John Wiley & Sons, Milton, Australia, 2009.
43. Berendt, B. and Brenstein, E., Visualizing individual differences in Web navigation: STRATDYN,
a tool for analysing navigation patterns. Behavior Research Methods, Instruments & Computers
33(2): 243–257, 2001.
42 Handbook of Educational Data Mining
44. Sheard, J., Ceddia, J., Hurst, J., and Tuovinen, J., Determining Website usage time from interac-
tions: Data preparation and analysis. Journal of Educational Technology Systems 32(1): 101–121, 2003.
45. Romero, C., Ventura, S., and Garcia, E., Data mining in course management systems: Moodle
case study and tutorial. Science Direct 51(1): 368–384, 2008.
46. Feng, M. and Heffernan, N., Informing teachers live about student learning: Reporting in the
Assistment system. Technology, Instruction, Cognition and Learning 3: 1–8, 2005.
4
A Data Repository for the EDM
Community: The PSLC DataShop
Contents
4.1 Introduction...........................................................................................................................43
4.2 The Pittsburgh Science of Learning Center DataShop....................................................44
4.3 Logging and Storage Methods............................................................................................ 45
4.4 Importing and Exporting Learning Data.......................................................................... 49
4.5 Analysis and Visualization Tools....................................................................................... 49
4.6 Uses of the PSLC DataShop................................................................................................. 51
4.7 Data Annotation: A Key Upcoming Feature.................................................................... 52
4.8 Conclusions............................................................................................................................ 53
Acknowledgment........................................................................................................................... 53
References........................................................................................................................................ 53
4.1╇ Introduction
In recent years, educational data mining has emerged as a burgeoning new area for scien-
tific investigation. One reason for the emerging excitement about educational data mining
(EDM) is the increasing availability of fine-grained, extensive, and longitudinal data on
student learning. These data come from many sources, including standardized tests com-
bined with student demographic data (for instance, www.icpsr.umich.edu/IAED), and
videos of classroom interactions [22]. Extensive new data sources have been transforma-
tional in science [5] and business (being a major part of the success of key businesses such
as Google, FedEx, and Wal-Mart).
In this chapter, we present an open data repository of learning data—the Pittsburgh
Science of Learning Center DataShop (http://pslcdatashop.org)—which we have designed to
have characteristics that make it particularly useful for EDM. We discuss the ways in which
members of the EDM community are currently utilizing this resource, and how DataShop’s
tools support both exploratory data analysis and EDM.
At present, DataShop specializes in data on the interaction between students and edu-
cational software, including data from online courses, intelligent tutoring systems, virtual
labs, online assessment systems, collaborative learning environments, and simulations.
Historically, educational data of this nature have been stored in a wide variety of for-
mats, including streamed log files directly from web-based or non-web-based educational
43
44 Handbook of Educational Data Mining
software, summary log files (sometimes including outputs from student models), and
researcher-specific database formats (both flat and relational). Moving toward a common
set of standards for sharing data, student models, and the results of EDM analyses—key
goals of the DataShop project—will facilitate more efficient, extensive storage and use of
such data, and more effective collaboration within the community.
DataShop contains data with three attributes that make it particularly useful for EDM
analyses. First, the data is fine-grained, at the grain size of semantically meaningful “transac-
tions” between the student and the software, including both the student’s action, and the
software’s response. Second, the data is longitudinal, involving student behavior and learning,
in many cases, over the span of an entire semester or year of study. Third, the data is exten-
sive, involving millions of transactions for some of the educational software packages for
which DataShop has data. These three characteristics have made the PSLC DataShop useful
to many educational data miners, both involved with the PSLC and external to it. We have
the ambition of becoming the key venue for sharing educational interaction data and collabo-
rating on its progressive analysis to support scientific discovery in education.
algebra [30], self-explanation in physics [20], the effectiveness of worked examples and
polite language in a stoichiometry tutor [25], and the optimization of knowledge compo-
nent learning in Chinese [27].
• Context message: The student, problem, and session with the tutor
• Tool message: Represents an action in the tool performed by a student or tutor
• Tutor message: Represents a tutor’s response to a student action
Below we see example context, tool, and tutor messages in the DataShop XML format:
FIGURE 4.1
A problem from Carnegie Learning’s Cognitive Tutor Geometry (2005 version).
47
48
TABLE 4.2
Data from the “Making-Cans” Example, Aggregated by Student-Step
Opportunity Total Total Assistance Error Knowledge
# Student Problem Step Count Incorrects Hints Score Rate Component
1 S01 WATERING_VEGGIES (WATERED-AREA Q1) 1 0 0 0 0 Circle-Area
2 S01 WATERING_VEGGIES (TOTAL-GARDEN Q1) 1 2 1 3 1 Rectangle-Area
3 S01 WATERING_VEGGIES (UNWATERED-AREA Q1) 1 0 0 0 0 Compose-Areas
4 S01 WATERING_VEGGIES DONE 1 0 0 0 0 Determine-Done
5 S01 MAKING-CANS (POG-RADIUS Q1) 1 0 0 0 0 Enter-Given
6 S01 MAKING-CANS (SQUARE-BASE Q1) 1 0 0 0 0 Enter-Given
7 S01 MAKING-CANS (SQUARE-AREA Q1) 1 0 0 0 0 Square-Area
8 S01 MAKING-CANS (POG-AREA Q1) 2 0 0 0 0 Circle-Area
9 S01 MAKING-CANS (SCRAP-METAL-AREA Q1) 2 2 0 2 1 Compose-Areas
10 S01 MAKING-CANS (POG-RADIUS Q2) 2 0 0 0 0 Enter-Given
11 S01 MAKING-CANS (SQUARE-BASE Q2) 2 0 0 0 0 Enter-Given
analysis: A researcher can determine if students are learning by viewing learning curves,
then drill down on individual problems, knowledge components, and students to analyze
performance in greater detail.
The following DataShop tools are available for exploratory data analysis:
FIGURE 4.2
Performance Profiler tool showing the average error rate, which is the incorrect entries (left side of bar) plus
hints (middle of bar) on students’ first attempt at each step across a selection of problems from a Geometry Area
dataset. Using controls not pictured, the user has selected to view the six problems with the lowest error rate
and the six with the highest error rate. The blue points are predictions based a particular knowledge component
model and the statistical model behind the LFA [14] algorithm.
A Data Repository for the EDM Community: The PSLC DataShop 51
50
40
Error rate (%)
30
20
10
0
0 1 2 3 4 5 6 7 8 9 10
Opportunity
All data All data Textbook (predicted)
FIGURE 4.3
Error rate learning curve with predicted values from a Geometry Area dataset. The solid curve represents the
actual values, each point is an average across all students and knowledge components for the given opportu-
nity. The dashed curve represents the predicted curve values, based on the LFA model [14], a variant of Item
Response Theory.
other metrics as well. The Learning Factors Analysis (LFA) model [14] can provide
predicted values for error rate learning curves. Figure 4.3 depicts error rate learn-
ing curves generated by DataShop. In this graph, “error rate,” or the percentage of
students that asked for a hint or made an incorrect attempt on their first attempt
on steps associated with a specific knowledge component, is shown on the y-axis.
The x-axis (“Opportunity”) indicates the nth time (e.g., 4 is the 4th time) a stu-
dent has (according to the current model) had an opportunity to use a knowledge
component to solve a step in a problem. Each unique step in a problem is dis-
tinct from other problem-solving steps, even if they involve the same knowledge
component(s).
components or skills (see square-area in Figure 4.3), they under-practiced some harder
skills (see trapezoid-area in Figure 4.3). Based on observation and further analysis, they
created a new version of the geometry tutor by resetting parameters that determine how
often skills are practiced. They ran a classroom experiment where students in a course
were pre- and posttested and randomly assigned to use either the previous or the new
tutor version. Students using the new version took 20% less time to finish the same curric-
ulum units (because over-practice was eliminated) and learned just as much as measured
by normal, transfer, and long-term retention tests.
A second demonstration of a data mining project that “closed the loop” is a work
by Baker et al. [8] who had done formal observations of student behavior in computer
labs while working through lessons of a middle school math Cognitive Tutor. Among a
number of categories of disengaged behavior, he found that “gaming the system” had
the largest correlation with poor learning outcomes. Gaming refers to student behavior
that appears to avoid thinking and learning through systematic guessing or fast and
repeated requests for increasing help. Baker used machine learning techniques to build
a “detector” capable of processing student log information, in real time, to determine
when students were gaming. The detector became the basis for an intervention system,
a “meta tutor,” designed to discourage gaming and engage students in supplementary
instruction on topics they had gamed. A controlled experiment demonstrated student-
learning benefits associated with this adaptive selection of supplementary instruction
for students observed to be gaming. Since then, the gaming detector has been used
within analyses of why students game [11], and precisely how gaming leads to poorer
learning.
Broadly, the data available in DataShop is driving the development of more precise com-
putational models of human cognition, motivation, and learning. In particular, an ongo-
ing area of research using DataShop data is the empirical evaluation and improvement of
knowledge representations [cf. 12,17,32]. As noted in a major national report, “psychomet-
ric validation of [online] assessments is needed so they can be compared with conventional
assessments, and complement and ultimately supplant them” [2].
4.8╇ Conclusions
We have described PSLC’s DataShop, an open repository and web-based tool suite for stor-
ing and analyzing click-stream data, fine-grained longitudinal data generated by online
courses, assessments, intelligent tutoring systems, virtual labs, simulations, and other
forms of educational technology. In contrast to other types of educational data such as
video and school-level data, data in DataShop include a rich set of semantic codes that
facilitate automated analysis and meaningful interpretation.
The PSLC DataShop uniform data format is an initial attempt to develop a common
standard that we hope will be useful to the field if not as is, then in driving better or more
useful common standards. In addition to being a source for learning data, it is also a place
where researchers can deposit data and then get help from other researchers who can per-
form secondary analysis on this data.
DataShop allows free access to a wide variety of datasets and analysis tools. These tools
help researchers visualize student performance, difficulties, and learning over time. Such
analyses can lead to demonstrably better instructional designs. The data can also drive
improved models of student cognition, affect, and learning that can be used to improve
online assessment and online learning. We take as a premise that the human brain con-
structs knowledge based on a variety of input sources (e.g., verbal, visual, and physical)
and in a fashion and at a grain size that may or may not conform to the structure as
conceived by an instructor or domain expert. The question of how the latent nature and
content of human knowledge representation can be discovered from data is a deep and
important scientific question, like for instance, the nature of the human genome. To answer
this question requires a vast collection of relevant data, associated analysis methods, and
new theory.
Acknowledgment
The research for this chapter was supported by the National Science Foundation award
number SBE-0354420 for the Pittsburgh Science of Learning Center.
References
1. Advanced Distributed Learning. 2003. SCORM Overview. Unpublished white paper. Alexandria,
VA: Advanced Distributed Learning.
2. Ainsworth, S., Honey, M., Johnson, W.L., Koedinger, K.R., Muramatsu, B., Pea, R., Recker, M.,
and Weimar, S. 2005. Cyberinfrastructure for Education and Learning for the Future (CELF): A Vision
and Research Agenda. Washington, DC: Computing Research Association.
3. Aleven, V., McLaren, B., Sewall, J., and Koedinger, K. 2006. The Cognitive Tutor Authoring
Tools (CTAT): Preliminary evaluation of efficiency gains. In: Proceedings of the Eighth International
Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, pp. 61–70.
4. Anderson, J.R., Corbett, A.T., Koedinger, K.R., and Pelletier, R. 1995. Cognitive tutors: Lessons
learned. The Journal of the Learning Sciences 4 (2): 167–207.
54 Handbook of Educational Data Mining
5. Atkins, D.E. (ed.). 2003. Revolutionizing Science and Engineering through Cyberinfrastructure: Report
on the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. Arlington,
VA: National Science Foundation. http://www.cise.nsf.gov/sci/reports/atkins.pdf
6. Baker, R.S.J.d., Corbett, A.T., and Aleven, V. 2008. Improving contextual models of guessing
and slipping with a truncated training set. In: Proceedings of the First International Conference on
Educational Data Mining, Montreal, Canada, pp. 67–76.
7. Baker, R.S.J.d., Corbett, A.T., and Aleven, V. 2008. More accurate student modeling through
contextual estimation of slip and guess probabilities in Bayesian knowledge tracing. In:
Proceedings of the Ninth International Conference on Intelligent Tutoring Systems, Montreal, Canada,
pp.406–415.
8. Baker, R., Corbett, A., Koedinger, K.R., Evenson, S., Roll, I., Wagner, A., Naim, M., Raspat,
J., Baker, D., and Beck, J. 2006. Adapting to when students game an intelligent tutoring system.
In M. Ikeda, K. D. Ashley, and T.-W. Chan (Eds.), Proceedings of the Eighth International Conference
on Intelligent Tutoring Systems, Jhongli, Taiwan, pp. 392–401.
9. Baker, R.S.J.d. and de Carvalho, A.M.J.A. 2008. Labeling student behavior faster and more pre-
cisely with text replays. In: Proceedings of the First International Conference on Educational Data
Mining, Montreal, Canada, pp. 38–47.
10. Baker, R.S.J.d., de Carvalho, A.M.J.A., Raspat, J., Aleven, V., Corbett, A.T., and Koedinger, K.R.
2009. Educational Software Features that Encourage and Discourage “Gaming the System”.
In: Proceedings of the 14th International Conference on Artificial Intelligence in Education, Brighton,
U.K., pp. 475–482.
11. Baker, R., Walonoski, J., Heffernan, N., Roll, I., Corbett, A., and Koedinger, K. 2008. Why stu-
dents engage in “Gaming the System” behavior in interactive learning environments. Journal of
Interactive Learning Research 19 (2): 185–224.
12. Barnes, T., Bitzer, D., and Vouk, M. 2005. Experimental analysis of the q-matrix method in
knowledge discovery. In: Proceedings of the 15th International Symposium on Methodologies for
Intelligent Systems, May 25–28, 2005, Saratoga Springs, NY.
13. Brown, J., Frishkoff, G., and Eskenazi, M. 2005. Automatic question generation for vocabu-
lary assessment. In: Proceedings of the Annual Human Language Technology Meeting, Vancouver,
Canada, 249–254.
14. Cen, H., Koedinger, K., and Junker, B. 2006. Learning Factors Analysis—A general method for
cognitive model evaluation and improvement. Proceedings of the Eighth International Conference
on Intelligent Tutoring Systems, Jhongli, Taiwan.
15. Cen, H., Koedinger, K., and Junker, B. 2007. Is over practice necessary?—Improving learn-
ing efficiency with the cognitive tutor through educational data mining. In R. Luckin and
K. Koedinger (eds.), Proceedings of the 13th International Conference on Artificial Intelligence in
Education, Los Angeles, CA, pp. 511–518.
16. Duval, E. and Hodgins, W. 2003. A LOM research agenda. In: Proceedings of the WWW2003—
Twelfth International World Wide Web Conference, 20–24 May 2003, Budapest, Hungary.
17. Falmagne, J.-C., Koppen, M., Villano, M., and Doignon, J.-P. 1990. Introduction to knowledge
spaces: How to build, test, and search them. Psychological Review 97: 201–224.
18. Feng, M. and Heffernan, N.T. 2007. Towards live informing and automatic analyzing of student
learning: Reporting in ASSISTment system. Journal of Interactive Learning Research 18 (2): 207–230.
19. Gertner, A.S. and VanLehn, K. 2000. Andes: A coached problem-solving environment for phys-
ics. In: Proceedings of the Fifth International Conference on Intelligent Tutoring Systems, Montreal,
Canada, pp.133–142.
20. Hausmann, R. and VanLehn, K. 2007. Self-explaining in the classroom: Learning curve evi-
dence. In McNamara and Trafton (eds.), Proceedings of the 29th Annual Cognitive Science Society,
Nashville, TN, pp. 1067–1072.
21. Leszczenski, J. M. and Beck J. E. 2007. What’s in a word? Extending learning factors analysis
to model reading transfer. In: Proceedings of the Educational Data Mining Workshop at the 14th
International Conference on Artificial Intelligence in Education, Los Angeles, CA, pp. 31–39.
A Data Repository for the EDM Community: The PSLC DataShop 55
22. MacWhinney, B., Bird, S., Cieri, C., and Martell, C. 2004. TalkBank: Building an open uni-
fied multimodal database of communicative interaction. Proceedings of the Fourth International
Conference on Language Resources and Evaluation, Lisbon, Portugal.
23. Matsuda, N., Cohen, W., Sewall, J., Lacerda, G., and Koedinger, K. R. 2007. Evaluating a simu-
lated student using real students data for training and testing. In: C. Conati, K. McCoy, and
G. Paliouras (eds.), Proceedings of the 11th International Conference on User Modeling, UM2007,
Corfu, Greece, pp. 107–116.
24. Matsuda, N., Cohen, W., Sewall, J., Lacerda, G., and Koedinger, K. R. 2007. Predicting stu-
dents’ performance with SimStudent: Learning cognitive skills from observation. In: R. Lukin,
K.R. Koedinger, and J. Greer (eds.), Proceedings of the 13th International Conference on Artificial
Intelligence in Education, Los Angeles, CA, pp. 467–476.
25. McLaren, B. M., Lim, S., Yaron, D., and Koedinger, K. R. 2007. Can a polite intelligent tutoring
system lead to improved learning outside of the lab? In R. Luckin and K.R. Koedinger (eds.),
Proceedings of the 13th International Conference on Artificial Intelligence in Education, Los Angeles,
CA, pp. 433–440.
26. Nwaigwe, A., Koedinger, K.R., VanLehn, K., Hausmann, R., and Weinstein, A. 2007. Exploring
alternative methods for error attribution in learning curves analyses in intelligent tutoring sys-
tems. In R. Luckin and K.R. Koedinger (eds.), Proceedings of the 13th International Conference on
Artificial Intelligence in Education, Los Angeles, CA, pp. 246–253.
27. Pavlik Jr. P. I., Presson, N., and Koedinger, K. R. 2007. Optimizing knowledge component learn-
ing using a dynamic structural model of practice. In R. Lewis and T. Polk (eds.), Proceedings of
the Eighth International Conference of Cognitive Modeling, Ann Arbor, MI.
28. Rafferty, A. N. and Yudelson, M. 2007. Applying learning factors analysis to build stereotypic
student models. In: Proceedings of the 13th International Conference on Artificial Intelligence in
Education, Los Angeles, CA.
29. Raftery, A. 1995. Bayesian model selection in social science research. Sociological Methodology 28:
111–163.
30. Ritter, S. and Koedinger, K. R. 1998. An architecture for plug-in tutor agents. Journal of Artificial
Intelligence in Education 7 (3–4): 315–347.
31. Smythe, C. and Roberts, P. 2000. An overview of the IMS question & test interoperability specifi-
cation. In: Proceedings of the Conference on Computer Aided Assessment (CAA’2000), Leicestershire,
U.K.
32. Tatsuoka, K. 1983. Rule space: An approach for dealing with misconceptions based on item
response theory. Journal of Educational Measurement 20 (4): 345–354.
5
Classifiers for Educational Data Mining
Contents
5.1 Introduction........................................................................................................................... 57
5.2 Background............................................................................................................................ 58
5.2.1 Predicting Academic Success.................................................................................. 58
5.2.2 Predicting the Course Outcomes............................................................................ 59
5.2.3 Succeeding in the Next Task................................................................................... 59
5.2.4 Metacognitive Skills, Habits, and Motivation...................................................... 59
5.2.5 Summary.................................................................................................................... 60
5.3 Main Principles..................................................................................................................... 60
5.3.1 Discriminative or Probabilistic Classifier?............................................................ 60
5.3.2 Classification Accuracy............................................................................................ 61
5.3.3 Overfitting.................................................................................................................. 62
5.3.4 Linear and Nonlinear Class Boundaries............................................................... 63
5.3.5 Data Preprocessing...................................................................................................64
5.4 Classification Approaches................................................................................................... 65
5.4.1 Decision Trees...........................................................................................................65
5.4.2 Bayesian Classifiers.................................................................................................. 66
5.4.3 Neural Networks...................................................................................................... 67
5.4.4 K-Nearest Neighbor Classifiers............................................................................... 68
5.4.5 Support Vector Machines........................................................................................ 69
5.4.6 Linear Regression..................................................................................................... 69
5.4.7 Comparison............................................................................................................... 70
5.5 Conclusions............................................................................................................................ 71
References........................................................................................................................................ 71
5.1╇ Introduction
The idea of classification is to place an object into one class or category, based on its other
characteristics. In education, teachers and instructors are always classifying their students
on their knowledge, motivation, and behavior. Assessing exam answers is also a classifica-
tion task, where a mark is determined according to certain evaluation criteria.
Automatic classification is an inevitable part of intelligent tutoring systems and adaptive
learning environments. Before the system can select any adaptation action such as select-
ing tasks, learning material, or advice, it should first classify the learner’s current situation.
For this purpose, we need a classifier—a model that predicts the class value from other
57
58 Handbook of Educational Data Mining
explanatory attributes. For example, one can derive the student’s motivation level from his
or her actions within the tutoring system or predict the students who are likely to fail or
drop out from their task scores. Such predictions are equally useful in traditional teach-
ing, but computerized learning systems often serve larger classes and collect more data
for deriving classifiers.
Classifiers can be designed manually, based on experts’ knowledge, but nowadays it
is more common to learn them from real data. The basic idea is the following: First, we
have to choose the classification method, like decision trees, Bayesian networks, or neural
networks. Second, we need a sample of data, where all class values are known. The data
is divided into two parts, a training set and a test set. The training set is given to a learning
algorithm, which derives a classifier. Then the classifier is tested with the test set, where
all class values are hidden. If the classifier classifies most cases in the test set correctly, we
can assume that it will also work accurately on future data. On the other hand, if the clas-
sifier makes too many errors (misclassifications) in the test data, we can assume that it was
a wrong model. A better model can be searched after modifying the data, changing the
settings of the learning algorithm, or by using another classification method.
Typically the learning task—like any data mining task—is an iterative process, where
one has to try different data manipulations, classification approaches, and algorithm set-
tings before a good classifier is found. However, there exists a vast amount of both practi-
cal and theoretical knowledge that can guide the search process. In this chapter, we try to
summarize and apply this knowledge on the educational context and give good recipes for
ways to succeed in classification.
The rest of the chapter is organized as follows: In Section 5.2, we survey the previ-
ous research where classifiers for educational purposes have been learned from data. In
Section 5.3, we recall the main principles affecting the model accuracy and give several
guidelines for accurate classification. In Section 5.4, we introduce the main approaches
for classification and analyze their suitability to the educational domain. The final conclu-
sions are drawn in Section 5.5.
5.2╇ Background
We begin with a literature survey on how data-driven classification has been applied in
the educational context. We consider four types of classification problems that have often
occurred in the previous research. For each group of experiments, we describe the classi-
fication problems solved, type and size of data, main classification methods, and achieved
accuracy (expected proportion of correct classifications in the future data).
important were used. In addition to demographic data and course scores, the data often
contained questionnaire data on students’ perceptions, experiences, and financial situation.
All experiments compared several classification methods. Decision trees were the most
common, but also Bayesian networks and neural networks were popular. The achieved
accuracy was in average 79%, which is a good result for such difficult and important tasks.
In the largest data sets (>15,000 rows), 93%–94% accuracy was achieved.
Real log data was used in the first five experiments. The size of data varied (30–950 rows,
in average 160), because some experiments pooled all data on one student’s actions, while
others could use even short sequences of sessions. The attributes concerned navigation
habits, time spent in different activities, number of pages read, number of times a task was
tried, etc. Only a small number of attributes (4–7) was used to learn models.
In [22], a large set of artificial data was simulated. Four attributes were used to
describe the student’s metacognitive skills (self-efficacy, goal orientation, locus of con-
trol, perceived task difficulty). The idea was that later these attributes could be derived
from log data.
The most common classification methods were decision trees, Bayesian networks,
K-nearest neighbor classifiers, and regression-based techniques. Classification accuracy
was reported only in four experiments and varied between 88% and 98%. One explana-
tion for the high accuracy is that the class values were often decided by experts using some
rules and the same attributes as the classifier used.
5.2.5╇ Summary
These 24 reviewed experiments give a good overview of the typical educational data and
the most popular classification methods used.
In most cases the class attribute concerned a student, and there was just one row of data
per student. In the university level studies, the size of the data was still large, but in the
course level studies, the data sets were small (50–350 rows). Larger data sets were available
for tasks, where each sequence of log data was classified separately.
Typically, the original data contained both categorical and numeric attributes. Often,
the data was discretized before modeling, but sometimes both numeric and categorical
versions of the data were modeled and compared. Purely numeric data occurred when all
attributes were task scores or statistics on log data (frequencies of actions, time spent in
actions). However, the task scores had often just a few values, and the data was discrete.
This is an important feature, because different classification methods suit for discrete and
continuous data.
The most common classification methods were decisions trees (16 experiments),
Bayesian networks (13), neural networks (6), K-nearest neighbor classifiers (6), support vec-
tor machines (SVMs) (3), and different kinds of regression-based techniques (10).
An alternative is a probabilistic classifier, which defines the probability of classes for all
classified rows. Now M(t)â•›=â•›[P(Câ•›=â•›c1|t), …, P(Câ•›=â•›cl|t)], where P(Câ•›=â•›ci|t) is the probability that
t belongs to class ci.
Probabilistic classification contains more information, which can be useful in some appli-
cations. One example is the task where one should predict the student’s performance in
a course, before the course has finished. The data often contains many inconsistent rows,
where all other attribute values are the same, but the class values are different. Therefore,
the class values cannot be determined accurately, and it is more informative for the course
instructors to know how likely the student will pass the course. It can also be pedagogi-
cally wiser to tell the student that she or he has 48% probability to pass the course than to
inform that she or he is going to fail.
Another example occurs in intelligent tutoring systems (or computerized adaptive test-
ing) where one should select the most suitable action (next task) based on the learner’s
current situation. Now each row of data t describes the learner profile. For each situation
(class ci) there can be several recommendable actions bj with probabilities P(Bâ•›=â•›bj|Câ•›=â•›ci),
which tell how useful action bj is in class ci. Now we can easily calculate the total probabil-
ity P(Bâ•›=â•›bj|t) that action bj is useful, given learner profile t.
If the accuracy in one class is more critical (e.g., all possible failures or dropouts should
be identified), we can often bias the model to minimize false positive (false negative) rate
in the cost of large false negative (positive) rate (see e.g., [47, Chapter 5.7]).
When r is the training set, the error is called the training error. If r has the same distribu-
tion as the whole population (e.g., all future students in a similar course), the training error
gives a good estimate for the generalization error also. Unfortunately, this is seldom the case
in the educational domain. The training sets are so small that they cannot capture the real
62 Handbook of Educational Data Mining
distribution and the resulting classifier is seriously biased. Therefore, we should somehow
estimate the generalization error on unseen data.
A common solution is to reserve a part of the data as a test set. However, if the data set
is already small, it is not advisable to reduce the training set any more. In this case, m-fold
cross-validation is a better solution. The idea is that we partition the original data set of size
n to m disjoint subsets of size n/m. Then we reserve one subset for validation and learn the
model with other m − 1 subsets. The procedure is repeated m times with different vali-
dation sets and finally we calculate the mean of classification errors. An extreme case is
leave-one-out cross-validation, where just one row is saved for validation and the model is
learned from the remaining n − 1 rows.
5.3.3╇ Overfitting
Overfitting is an important problem related to accuracy. Overfitting means that the
model has fitted to the training data too much so that it expresses even the rarest special
cases and errors in data. The resulting model is so specialized that it cannot generalize
to future data. For example, a data set that was collected for predicting the student’s suc-
cess in a programming course contained one female student, who had good IT skills and
self-efficacy, and knew the idea of programming beforehand, but still dropped out the
course. Still, we could not assume that all future students with the same characteristics
would drop out. (In fact, all the other female students with good self-efficacy passed the
course.)
Overfitting happens when the model is too complex relative to the data size. The reason
is that complex models have higher representative power, and they can represent all data
peculiarities, including errors. On the other hand, simple models have lower representa-
tive power but they generalize well to future data. If the model is too simple, it cannot
catch any essential patterns in the data, but underfits. It means that the model approximates
poorly the true model or there does not exist a true model.
Figure 5.1 demonstrates the effects of overfitting and underrating when the model
complexity increases. In this example, we used the previously mentioned data set from a
programming course. The attributes were added to the model in the order of their impor-
tance (measured by Information Gain) and a decision tree was learned with ID3 algorithm
[39]. The simplest model used just one attribute (exercises points in applets), while the last
model used 23 attributes. In the simplest models both the training and testing errors were
large, because there were not enough attributes to discriminate the classes, and the models
underfitted. On the other hand, the most complex models achieved a small training error,
because they could fit the training data well. In the same time, the testing error increased,
because the models had overfitted.
In the educational domain, overfitting is a critical problem, because there are many
attributes available to construct a complex model, but only a little data to learn it accu-
rately. As a rule of thumb it is often suggested (e.g., [12,24]) that we should have at least
5–10 rows of data per model parameter. The simpler the model is, fewer the parameters
needed. For example, if we have k binary-valued attributes, a naive Bayes classifier con-
tains O(k) parameters, while a general Bayesian classifier has in the worst case O(2k)
parameters. In the first case, it is enough that nâ•›>â•›5k (10k), while in the latter case, we need
at least nâ•›>â•›5â•›·â•›2k (10 · 2 k) rows of data. If the attributes are not binary-valued, more data is
needed.
Classifiers for Educational Data Mining 63
60
Underfitting
50
40
Error %
30
20
Overfitting
10
0
0 5 10 15 20 25
Number of attributes
FIGURE 5.1
Effects of underfitting and overfitting. The training error decreases but testing error increases with the number
of attributes.
In practice, there are two things we can do: (1) use simple classification methods requir-
ing fewer model parameters, and (2) reduce the number of attributes and their domains by
feature selection and feature extraction.
x x
FIGURE 5.2
Linearly separable and inseparable class boundaries.
64 Handbook of Educational Data Mining
less sensitive to overfitting. However, some classes can be separated only by a nonlinear
boundary and a nonlinear classifier is needed.
or less heuristic. Overviews of feature extraction and selection techniques can be found in
[16, Chapter 3] and [47, Chapters 7.1 through 7.3].
(see e.g., [44]). In ensemble learning, we can combine several models with different struc-
tures, and even from different modeling paradigms. In practice, these methods can
remarkably improve classification accuracy.
Finally, we recall that learning a globally optimal decision tree is a nondeterministic
polynomial time (NP)-complete problem [23]. That is why all the common decision tree
algorithms employ some heuristics and can produce suboptimal results.
P(C = c)P(t C = c)
P(C = c t) =
P(t)
In practice, the problem is the large number of probabilities we have to estimate. For
example, if all attributes A1, …, Ak have ν different values and all Ai’s are mutually depen-
dent, we have to define O(νk) probabilities. This means that we also need a large training
set to estimate the required joint probability accurately.
Another problem that decreases the classification accuracy of Bayesian networks is
the use of Minimum Description Length (MDL) score function for model selection [13].
MDL measures the error in the model over all variables, but it does not necessarily
minimize the error in the class variable. This problem occurs especially when the model
contains several attributes and the accuracy of estimates P(A1, …, Ak) begins to dominate
the score.
The naive Bayes model solves both problems. The model complexity is restricted by a
strong independence assumption: we assume that all attributes A1, …, Ak are conditionally
∏
k
independent, given the class attribute C, that is, P( A1 , … , Ak|C ) = P( Ai|C ). This naive
i =1
Bayes assumption can be represented as a two-layer Bayesian network (Figure 5.3), with the
class variable C as the root node and all the other variables as leaf nodes. Now we have
to estimate only O(kv) probabilities per class. The use of MDL score function in the model
selection is also avoided, because the model structure is fixed, once we have decided the
explanatory variables Ai.
Classifiers for Educational Data Mining 67
a three-layer network as a default and add layers only for serious reasons. For stopping
criterion (deciding when the model is ready), a popular strategy is to use a separate test
set [34, p. 111].
FFNNs have several attractive features. They can easily learn nonlinear boundaries and
in principle represent any kind of classifiers. If the original variables are not discrimina-
tory, FFNN transforms them implicitly. In addition, FFNNs are robust to noise and can be
updated with new data.
The main disadvantage is that FFNNs need a lot of data—much more than typical edu-
cational data sets contain. They are very sensitive to overfitting and the problem is even
more critical with small training sets. The data should be numeric and categorical data
must be somehow quantized before it can be used. However, this increases the model
complexity and the results are sensitive to the quantization method used.
The neural network model is a black box and it is difficult for people to understand
the explanations for the outcomes. In addition, neural networks are unstable and achieve
good results only in good hands [12]. Finally, we recall that finding an optimal FFNN is
an NP-complete problem [3] and the learning algorithm can get stuck at a local optimum.
Still, the training can be time consuming, especially if we want to circumvent overfitting.
the classification is quite robust to noise and missing values. Especially weighted distance
smooths the noise in attribute values and missing values can be simply skipped. Nearest
neighbor classifiers have very high representative power, because they can work with any
kind of class boundaries, given sufficient data.
The main disadvantage is the difficulty to select distance function d. Educational data
often consists of both numeric and categorical data, and numeric attributes can be in dif-
ferent scales. It means that we not only need a weighted distance function, but also a large
data set to learn the weights accurately. Irrelevant attributes are also common in some
educational data sets (e.g., questionnaire data) and they should be removed first.
The lack of an explicit model can be either an advantage or a disadvantage. If the model
is very complex, it is often easier to approximate it only locally. In addition, there is no need
to update the classifier when new data is added. However, this kind of “lazy methods” are
slower in classification than model-based approaches. If the data set is large, we need some
index to find the nearest neighbors efficiently. It is also noteworthy that an explicit model
is useful for human evaluators and designers of the system.
However, the data should not contain large gaps (empty areas) and the number of outliers
should be small [21, p. 162].
5.4.7╇ Comparison
Selecting the most appropriate classification method for the given task is a difficult problem
and no general answer can be given. In Table 5.1, we have evaluated the main classification
methods according to eight general criteria, which are often relevant when educational
data is classified.
The first criterion concerns the form of class boundaries. Decision trees, general Bayesian
networks, FFNNs, nearest neighbor classifiers, and SVMs can represent highly nonlinear
boundaries. Naive Bayes model using nominal data can represent only a subset of linear
boundaries, but with numeric data it can represent quite complex nonlinear boundaries.
Linear regression is restricted to only linear boundaries, but it tolerates small deviations
from the linearity. It should be noticed that strong representative power is not desirable if
we have only little data and a simpler, linear model would suffice. The reason is that com-
plex, nonlinear models are also more sensitive to overfitting.
The second criterion, accuracy on small data sets, is crucial for the educational domain.
An accurate classifier cannot be learned if there is insufficient data. The sufficient amount
of data depends on the model complexity. In practice, we should favor simple models,
such as naive Bayes classifiers or linear regression. Support vector machines can produce
extremely good results if the model parameters are correctly selected. On the other hand,
decision trees, FFNNs, and nearest neighbor classifiers require much larger data sets to
work accurately. The accuracy of general Bayesian classifiers depends on the complexity
of the structure used.
The third criterion concerns whether the method can handle incomplete data, that is, noise
(errors), outliers (which can be due to noise), and missing values. Educational data is usu-
ally clean, but outliers and missing values occur frequently. Naive and general Bayesian
TABLE 5.1
Comparison of Different Classification Paradigms
DT NB GB FFNN K-nn SVM LR
Nonlinear + (+) + + + + −
boundaries
Accuracy on small − + +/− − − + +
data sets
Works with − + + + + − −
incomplete data
Supports mixed + + + − + − −
variables
Natural + + + − (+) − +
interpretation
Efficient reasoning + + + + − + +
Efficient learning +/− + − − +/− + +
Efficient updating − + + + + − +
Signâ•›+â•›means that the method supports the property, − that it does not. The abbrevia-
tions are DT, decision tree; NB, naive Bayes classifier; GB, general Bayesian clas-
sifier; FFNN, feed-forward neural network; K-nn, K-nearest neighbor classifier;
SVM, support vector machine; LR, linear regression.
Classifiers for Educational Data Mining 71
classifiers, FFNNs, and nearest neighbor models are especially robust to noise in the data.
Bayesian classifiers, nearest neighbor models, and some enlargements of decision trees can
also handle missing values quite well. However, decision trees are generally very sensi-
tive to small changes such as noise in the data. Linear regression cannot handle missing
attribute values at all and serious outliers can corrupt the whole model. SVMs are also
sensitive to outliers.
The fourth criterion tells whether the method supports mixed variables, that is, both
numeric and categorical. All methods can handle numeric attributes, but categorical attri-
butes are problematic for FFNNs, linear regression, and SVMs.
Natural interpretation is also an important criterion, since all educational models should
be transparent to the learner (e.g., [37]). All the other paradigms except neural networks and
SVMs offer more or less understandable models. Especially decision trees and Bayesian
networks have a comprehensive visual representation.
The last criteria concern the computational efficiency of classification, learning, and updat-
ing the model. The most important is efficient classification, because the system should
adapt to the learner’s current situation immediately. For example, if the system offers indi-
vidual exercises for learners, it should detect when easier or more challenging tasks are
desired. Nearest neighbor classifier is the only one that lacks this property. The efficiency
of learning is not so critical, because it is not done in real time. In some methods, the mod-
els can be efficiently updated, given new data. This is an attractive feature because often
we can collect new data when the model is already in use.
5.5╇ Conclusions
Classification has many applications in both traditional education and modern educational
technology. The best results are achieved when classifiers can be learned from real data,
but in educational domain the data sets are often too small for accurate learning.
In this chapter, we have discussed the main principles that affect classification accuracy.
The most important concern is to select a sufficiently powerful model, which catches the
dependencies between the class attribute and other attributes, but which is sufficiently
simple to avoid overfitting. Both data preprocessing and the selected classification method
affect this goal. To help the reader, we have analyzed the suitability of different classifica-
tion methods for typical educational data and problems.
References
1. R.S. Baker, A.T. Corbett, and K.R. Koedinger. Detecting student misuse of intelligent tutoring
systems. In Proceedings of the 7th International Conference on Intelligent Tutoring Systems (ITS‘04),
pp. 531–540. Springer Verlag, Berlin, Germany, 2004.
2. K. Barker, T. Trafalis, and T.R. Rhoads. Learning from student data. In Proceedings of the 2004
IEEE Systems and Information Engineering Design Symposium, pp. 79–86. University of Virginia,
Charlottesville, VA, 2004.
3. A. Blum and R.L. Rivest. Training 3-node neural network is NP-complete. In Proceedings of the
1988 Workshop on Computational Learning Theory (COLT), pp. 9–18. MIT, Cambridge, MA, 1988.
72 Handbook of Educational Data Mining
4. V.P. Bresfelean, M. Bresfelean, N. Ghisoiu, and C.-A. Comes. Determining students’ academic
failure profile founded on data mining methods. In Proceedings of the 30th International Conference
on Information Technology Interfaces (ITI 2008), pp. 317–322. Dubrovnik, Croatia, 2008.
5. M. Cocea and S. Weibelzahl. Can log files analysis estimate learners’ level of motivation?
In Proceedings of Lernen–Wissensentdeckung-Adaptivität (LWA2006), pp. 32–35, Hildesheim,
Germany, 2006.
6. M. Cocea and S. Weibelzahl. Cross-system validation of engagement prediction from log files.
In Creating New Learning Experiences on a Global Scale, Proceedings of the 2nd European Conference
on Technology Enhanced Learning (EC-TEL2007), volume 4753, Lecture Notes in Computer Science,
pp. 14–25. Springer, Heidelberg, Germany, 2007.
7. M. Damez, T.H. Dang, C. Marsala, and B. Bouchon-Meunier. Fuzzy decision tree for user mod-
eling from human-computer interactions. In Proceedings of the 5th International Conference on
Human System Learning (ICHSL’05), pp. 287–302. Marrakech, Maroc, 2005.
8. G. Dekker, M. Pechenizkiy, and J. Vleeshouwers. Predicting students drop out: A case study.
In Educational Data Mining 2009: Proceedings of the 2nd International Conference on Educational
Data Mining (EDM’09), pp. 41–50. Cordoba, Spain, July 1–3, 2009.
9. M.C. Desmarais and X. Pu. A Bayesian student model without hidden nodes and its com-
parison with item response theory. International Journal of Artificial Intelligence in Education,
15:291–323, 2005.
10. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-
one loss. Machine Learning, 29:103–130, 1997.
11. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd ed. Wiley-Interscience
Publication, New York, 2000.
12. R. Duin. Learned from neural networks. In Proceedings of the 6th Annual Conference of the Advanced
School for Computing and Imaging (ASCI-2000), pp. 9–13. Advanced School for Computing and
Imaging (ASCI), Delft, the Netherlands, 2000.
13. N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning,
29(2–3):131–163, 1997.
14. W. Hämäläinen, T.H. Laine, and E. Sutinen. Data mining in personalizing distance education
courses. In C. Romero and S. Ventura, editors, Data Mining in e-Learning, pp. 157–171. WitPress,
Southampton, U.K., 2006.
15. W. Hämäläinen and M. Vinni. Comparison of machine learning methods for intelligent tutor-
ing systems. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems, vol-
ume 4053, Lecture Notes in Computer Science, pp. 525–534. Springer-Verlag, Berlin, Germany,
2006.
16. J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San
Francisco, CA, 2006.
17. D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, Cambridge, MA, 2002.
18. R. Hecht-Nielsen. Theory of the backpropagation neural network. In Proceedings of the
International Joint Conference on Neural Networks (IJCNN), volume 1, pp. 593–605. IEEE,
Washington, DC, 1989.
19. S. Herzog. Estimating student retention and degree-completion time: Decision trees and neural
networks vis-a-vis regression. New Directions for Institutional Research, 131:17–33, 2006.
20. A. Hinneburg, C.C. Aggarwal, and D.A. Kleim. What is the nearest neighbor in high dimen-
sional spaces? In Proceedings of 26th International Conference on Very Large Data Bases (VLDB
2000), pp. 506–515. Morgan Kaufmann, Cairo, Egypt, September 10–14, 2000.
21. P.J. Huber. Robust Statistics. Wiley Series in Probability and Mathematical Statistics. John
Wiley & Sons, New York, 1981.
22. T. Hurley and S. Weibelzahl. Eliciting adaptation knowledge from online tutors to increase
motivation. In Proceedings of 11th International Conference on User Modeling (UM2007), volume
4511, Lecture Notes in Artificial Intelligence, pp. 370–374. Springer Verlag, Berlin, Germany, 2007.
23. L. Hyafil and R.L. Rivest. Constructing optimal binary decision trees is NP-complete. Information
Processing Letters, 5(1):15–17, 1976.
Classifiers for Educational Data Mining 73
24. A.K. Jain, P.W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.
25. I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.
26. A. Jonsson, J. Johns, H. Mehranian, I. Arroyo, B. Woolf, A.G. Barto, D. Fisher, and S. Mahadevan.
Evaluating the feasibility of learning student models from data. In Papers from the 2005 AAAI
Workshop on Educational Data Mining, pp. 1–6. AAAI Press, Menlo Park, CA, 2005.
27. C. Jutten and J. Herault. An adaptive algorithm based on neuromimetic architecture. Signal
Processing, 24:1–10, 1991.
28. S.B. Kotsiantis, C.J. Pierrakeas, and P.E. Pintelas. Preventing student dropout in distance
learning using machine learning techniques. In Proceedings of 7th International Conference on
Knowledge-Based Intelligent Information and Engineering Systems (KES-2003), volume 2774, Lecture
Notes in Computer Science, pp. 267–274. Springer-Verlag, Heidelberg, Germany, 2003.
29. M.-G. Lee. Profiling students’ adaption styles in web-based learning. Computers & Education,
36:121–132, 2001.
30. C.X. Ling and H. Zhang. The representational power of discrete Bayesian networks. The Journal
of Machine Learning Research, 3:709–721, 2003.
31. C.-C. Liu. Knowledge discovery from web portfolios: Tools for learning performance assess-
ment. PhD thesis, Department of Computer Science Information Engineering Yuan Ze
University, Taiwan, 2000.
32. Y. Ma, B. Liu, C.K. Wong, P.S. Yu, and S.M. Lee. Targeting the right students using data mining.
In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD’00), pp. 457–464. ACM Press, New York, 2000.
33. B. Minaei-Bidgoli, D.A. Kashy, G. Kortemeyer, and W. Punch. Predicting student performance:
An application of data mining methods with an educational web-based system. In Proceedings
of 33rd Frontiers in Education Conference, pp. T2A-13–T2A-18. Westminster, CO, November 5–8,
2003.
34. T.M. Mitchell. Machine Learning. McGraw-Hill Companies, New York, 1997.
35. M. Mühlenbrock. Automatic action analysis in an interactive learning environment. In
Proceedings of the Workshop on Usage Analysis in Learning Systems at AIED-2005, pp. 73–80.
Amsterdam, the Netherlands, 2005.
36. N. Thai Nghe, P. Janecek, and P. Haddawy. A comparative analysis of techniques for predicting
academic performance. In Proceedings of the 37th Conference on ASEE/IEEE Frontiers in Education,
pp. T2G-7–T2G-12. Milwaukee, WI, October 10–13, 2007.
37. T. O’Shea, R. Bornat, B. Boulay, and M. Eisenstad. Tools for creating intelligent computer tutors.
In Proceedings of the International NATO Symposium on Artificial and Human Intelligence, pp. 181–
199. Elsevier North-Holland, Inc., New York, 1984.
38. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan
Kaufman Publishers, San Mateo, CA, 1988.
39. J.R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
40. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
41. C. Romero, S. Ventura, P.G. Espejo, and C. Hervas. Data mining algorithms to classify students.
In Educational Data Mining 2008: Proceedings of the 1st International Conference on Educational Data
Mining, pp. 8–17. Montreal, Canada, June 20–21, 2008.
42. S.J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach, 2nd edn. Prentice Hall,
Englewood Cliffs, NJ, 2002.
43. J.F. Superby, J.-P. Vandamme, and N. Meskens. Determination of factors influencing the
achievement of the first-year university students using data mining methods. In Proceedings of
the Workshop on Educational Data Mining at ITS’06, pp. 37–44. Jhongali, Taiwan, 2006.
44. G. Valentini and F. Masulli. Ensembles of Learning Machines, volume 2486, Lecture Notes in
Computer Science, pp. 3–22. Springer-Verlag, Berlin, Germany, 2002. Invited Review.
45. V.N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.
46. J. Vomlel. Bayesian networks in educational testing. International Journal of Uncertainty, Fuzziness
and Knowledge Based Systems, (Supplementary Issue 1):83–100, 2004.
74 Handbook of Educational Data Mining
47. I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.
Morgan Kaufmann, San Francisco, CA, 2005.
48. Xycoon. Linear regression techniques. Chapter II. In Statisticsâ•›–â•›Econometricsâ•›–â•›Forecasting (Online
Econometrics Textbook), Office for Research Development and Education, 2000–2006. Available
on http://www.xycoon.com/. Retrieved on January 1, 2006.
49. W. Zang and F. Lin. Investigation of web-based teaching and learning by-boosting algorithms.
In Proceedings of IEEE International Conference on Information Technology: Research and Education
(ITRE 2003), pp. 445–449. Malaga, Spain, August 11–13, 2003.
6
Clustering Educational Data
Contents
6.1 Introduction........................................................................................................................... 75
6.2 The Clustering Problem in Data Mining........................................................................... 76
6.2.1 k-Means Clustering................................................................................................... 78
6.2.2 Fuzzy c-Means Clustering....................................................................................... 79
6.2.3 Kohonen Self-Organizing Maps............................................................................. 81
6.2.4 Generative Topographic Mapping.......................................................................... 82
6.3 Clustering in e-Learning..................................................................................................... 83
6.3.1 Cluster Analysis of e-Learning Material............................................................... 83
6.3.2 Clustering of Students according to Their e-Learning Behavior....................... 85
6.3.3 Clustering Analysis as a Tool to Improve e-Learning Environments.............. 88
6.4 Conclusions............................................................................................................................ 88
Acknowledgment........................................................................................................................... 89
References........................................................................................................................................ 89
6.1╇ Introduction
The Internet and the advance of telecommunication technologies allow us to share and
manipulate information in nearly real time. This reality is determining the next generation
of distance education tools. Distance education arose from traditional education in order
to cover the necessities of remote students and/or help the teaching–learning process, rein-
forcing or replacing traditional education. The Internet takes this process of delocaliza-
tion of the educative experience to a new realm, where the lack of face-to-face interaction
is, at least partially, replaced by an increased level of technology-mediated interaction.
Furthermore, telecommunications allow this interaction to take forms that were not avail-
able to traditional presential and distance learning teachers and learners.
Most e-learning processes generate vast amounts of data that could well be used to bet-
ter understand the learners and their needs, as well as to improve e-learning systems. Data
mining was conceived to tackle this type of problem. As a field of research, it is almost con-
temporary to e-learning, and somehow vaguely defined. Not because of its complexity, but
because it places its roots in the ever-shifting world of business studies. In its most formal
definition, it can be understood not just as a collection of data analysis methods, but as a
data analysis process that encompasses anything from data understanding, preprocessing,
and modeling to process evaluation and implementation [1]. It is nevertheless usual to pay
preferential attention to the data mining methods themselves. These commonly bridge
75
76 Handbook of Educational Data Mining
the fields of traditional statistics, pattern recognition, and machine learning to provide
analytical solutions to problems in areas as diverse as biomedicine, engineering, and busi-
ness, to name just a few. An aspect that perhaps makes data mining unique is that it pays
special attention to the compatibility of the data modeling techniques with new informa-
tion technologies (IT) and database development and operation, often focusing on large,
heterogeneous, and complex databases. e-Learning databases often fit this description.
Therefore, data mining can be used to extract knowledge from e-learning systems
through the analysis of the information available in the form of data generated by their
users. In this case, the main objective becomes finding the patterns of teachers’ and stu-
dents’ system usage and, perhaps most importantly, discovering the students’ learning
behavior patterns. For a detailed insight into data mining applied to e-learning systems,
the reader is referred to [2].
e-Learning has become a social process connecting students to communities of devices,
people, and situations, so that they can construct relevant and meaningful learning expe-
riences that they co-author themselves. Collaborative learning is one of the components
of pervasive learning, the others being autonomy, location, and relationship. There are a
variety of approaches in education that involve joint intellectual effort by students of simi-
lar performance. Groups of students work together to understand concepts, share ideas,
and ultimately succeed as a whole. Students of similar performance can be identified and
grouped by the assessment of their abilities.
The process of grouping students is, therefore, one of relevance for data mining. This
naturally refers to an area of data analysis, namely data clustering, which is aimed at
discovering the natural grouping structure of data. This chapter is devoted to this area
of data mining and its corresponding analytical methods, which hold the promise of
providing useful knowledge to the community of e-learning practitioners. Clustering
and visualization methods can enhance the e-learning experience, due to the capacity of
the former to group similar actors based on their similarities, and the ability of the latter
to describe and explore these groups intuitively.
The remaining of the chapter is organized as follows: In Section 6.2, we provide a brief
account of the clustering problem and some of the most popular clustering techniques.
Section 6.3 provides a broad review of clustering in e-learning. Finally, Section 6.4 wraps
up the chapter by drawing some general conclusions.
falls short in the case of some fuzzy or probabilistic clustering methods in which, instead
of hard assignments, the result is a measure or a probability of cluster membership. This
assignment must be the result of applying similarity measures such as, in the most stan-
dard case, an Euclidean distance between points. We cannot ignore that, often, similarity
is a tricky concept in clustering, given that, beyond data density, often we must also con-
sider cluster shape and size.
In terms of processes, we could broadly separate hierarchical and partitional methods.
Hierarchical clustering assumes that cluster structure appears at different, usually nested
structural levels. Partitional clustering considers a single common level for all clusters and
it can therefore be considered as a special case of hierarchical clustering. Among parti-
tional methods, k-means is perhaps the most popular one [3]. Plenty of extensions of this
method have been developed over the last decades, including fuzzy versions that avoid
hard cluster assignments, such as Fuzzy c-means [4] or even adaptations to hierarchical
processing such as Bisecting k-means [5].
Another way of discriminating between different clustering techniques is according to
the objective function they usually aim to optimize. Also, and this may perhaps be a neater
criterion, clustering methods can be either stochastic in nature (probabilistic techniques)
or heuristic (mostly algorithmic). The former resort to the definition of probability den-
sity distributions and, among their advantages, most of their adaptive parameters can be
estimated in an automated and principled manner. Mixture models are a basic example of
these [6] and they often entail Bayesian methods, e.g., [7]. Heuristic methods are plentiful,
very different in terms of origin (theoretically and application-wise) and, therefore, far
too varied for a basic review such as this. One of the most successful ones [8] is Kohonen’s
self-organizing map (SOM [9]), in its many forms. The roots of this model are in the field
of neuroscience, but it has become extremely successful as a tool for simultaneous data
clustering and visualization. A probabilistic alternative to SOM, namely generative topo-
graphic mapping (GTM [10]) was defined to preserve SOM’s functionality while avoiding
many of its limitations.
The evaluation of clustering processes is not straightforward. In classification or pre-
diction, evaluation is usually performed on the basis of the available class label informa-
tion in test sets. This is not the case in clustering. The validity of a clustering solution
often depends on the expert domain point of view. Quantifying this can be difficult, since
the interpretation of how interesting a clustering is, will inevitably be an application-
dependent matter and therefore subjective to some degree.
However, there are many issues that are relevant for the choice of clustering technique.
One of them is data heterogeneity. Different data require different methods and, often,
data come in different modalities simultaneously. Some examples include the analysis of
data streams [11], and rank data [12]. Data sample size is also important, as not all meth-
ods are suited to deal with large data collections (see, for instance, [13]). The methods
capable of handling very large data sets can be grouped in different categories, according
to [14] efficient nearest-neighbor (NN) search, data summarization, distributed comput-
ing, incremental clustering, and sampling-based methods. Sometimes, we have to analyze
structured data, for which there are semantic relationships within each object that must
be considered, or data for which graphs are a natural representation. Graphical clustering
methods can be of use in these cases [15,16].
Next, we single out some popular clustering techniques and describe them in further
detail. They are k-means, fuzzy c-means (FCM), SOM, and GTM. They represent rather
different methodological approaches to the common target problem of multivariate data
grouping.
(a) (b)
FIGURE 6.1
k-Means results to find three clusters in a simple data set.
procedures. The most popular metric in clustering analysis is the Euclidean distance. It
works well when a data set has intuitive compact and/or isolated clusters. k-Means typi-
cally resorts to the Euclidean metric for computing the distance between points and clus-
ter centers. Therefore, k-means tends to find hyperspherical or ball-shaped clusters in
data. Alternatively, k-means with the Mahalanobis distance metric has been used to detect
hyperellipsoidal clusters [17].
As was stated above, the basic k-means requires as initial parameter the number of clus-
ters, i.e., the K value; this is a critical decision, due to the fact that different initializations
can lead to different clustering solutions, as the algorithm has no guarantee of converging
to a global minimum. The final clustering solution is, therefore, rather sensitive to the ini-
tialization of the cluster centers.
To deal with the original k-means drawbacks, many variants or extensions have been
developed. Some of these extensions deal with both the selection of a good initial parti-
tion and allowing splitting and merging clusters, all of them with the ultimate goal to
find global minimum value. Two well-known variants of k-means are ISODATA [18] and
FORGY [19]. In k-means, each data point is assigned to a single cluster (hard assignment).
Fuzzy c-means, proposed by Dunn [4], is an extension of k-means where each data point
can be a member of multiple clusters with a membership value (soft assignment). Data
reduction by replacing group examples with their centroids before clustering them was
used to speed up k-means and fuzzy c-means in [20]. Other variants produce a hierarchical
clustering by applying the algorithm with Kâ•›=â•›2 to the overall data set and then repeating
the procedure recursively within each cluster [21].
clusters
Num
∀x ∑
k =1
uk ( x) = 1
(6.1)
F1 F2
7
3 4
8
1 9
2 6
5
H1 H2
FIGURE 6.2
Fuzzy and hard clusters representation.
With FCM, the centroid of a cluster is the mean of all points, weighted by their degree of
belonging to the cluster, as shown in (6.2):
centerk =
∑ u ( x) x
x
k
m
(6.2)
∑ u ( x)
x
k
m
The degree of belonging is related to the inverse of the distance to the cluster center, as it
is presented in (6.3):
1
uk ( x ) = (6.3)
d ( centerk , x )
The coefficients are normalized and fuzzified with a real parameter mâ•›>â•›1 so that their
sum is 1:
1
uk ( x ) = (6.4)
∑( d ( centerk , x ) /d ( centerj , x ) )
2/( m −1)
When m is close to 1, then cluster center closest to a given data point receives much more
weight than the rest, and the algorithm becomes similar to k-means.
The FCM algorithm could be summarized in the following steps [22]:
1. Select an initial fuzzy partition of the N objects into K clusters by selecting the
Nâ•›×â•›K membership matrix U. An element uij of this matrix represents the grade of
membership of object xi in cluster Cj. Where uij have values between 0 and 1.
2. Reassign data samples to clusters to reduce a criterion function value and recom-
pute U. To perform this, using U, find the value of a fuzzy criterion function, e.g.,
a weighted squared error criterion function, associated with the corresponding
partition. One possible fuzzy criterion function is show in (6.5).
N N
E2 ( x , U ) = ∑∑u ij
2
x i − ck (6.5)
i =1 k =1
FCM can still converge to local minima of the squared error criterion. The design of mem-
bership functions is the most important problem in fuzzy clustering; different choices
include those based on similarity decomposition and centroids of clusters [22].
Neuron
x1 x2
Input pattern
FIGURE 6.3
Simple SOM model.
cluster is based on a delta function, i.e., on a winner-takes-all in which only the winning
neuron is updated to become closer to the observed data point in each iteration.
where
Φ are basis functions Φ(u)â•›=â•›(ϕ1(u), …, ϕM(u)) that introduce the nonlinearity in the
mapping
W is the matrix of adaptive weights wmd that defines the mapping
u is a point in latent space
K D2
∑ β β
1 2
p ( x W, β ) = exp − y k − x (6.7)
K k =1
2π 2
where the D elements of y are given by (6.6). This density allows for the definition of a
model likelihood, and the well-known Expectation-Maximization (EM [24]) algorithm can
be used to obtain the maximum likelihood estimates of the adaptive parameters (W and β)
of the model. See [10] for details on these calculations.
Model interpretation usually requires a drastic reduction in the dimensionality of the
data. Latent variable models can provide such interpretation through visualization, as
they describe the data in intrinsically low-dimensional latent spaces. Each of the latent
points u k in the latent visualization space is mapped, following (6.6), as ykâ•›=â•›Φ(u k)W. The yk
points are usually known as reference vectors or prototypes. Each of the reference vector
elements corresponds to one of the input variables, and their values over the latent visu-
alization space can be color-coded to produce reference maps that provide information on
the behavior of each variable and its influence on the clustering results. Each of the latent
space points can be considered by itself as a cluster representative.
The probability theory foundations of GTM allow the definition of principle alternatives
for the automatic detection of outliers [25] and for unsupervised feature relevance deter-
mination and feature selection [26].
TABLE 6.1
Research in Clustering Analysis of e-Learning Material
References Clustering Method e-Learning Objective
[20] Bisection k-means To find and organize distributed courseware resources.
[18] SOM To cluster similar learning material into classes, based on
semantic similarities.
[19] Hierarchical agglomerative To group similar learning documents based on their topics
clustering, single-pass and similarities. A DIG for document representation is
clustering and k-NN introduced.
[21] Fuzzy ART To generate test sheets. Fuzzy logic theory is used to
determine the difficulty levels of test items, and fuzzy ART
to cluster the test items into groups. Dynamic
programming is used for the test sheet construction.
[24] Conceptual clustering algorithm Reorganization of learning content by means of a
(DISC) computerized adaptive test. The concept of knowledge unit
extracted from each course topic is introduced.
[26] SOM The clustering of the course materials using a measure of the
similarity between the terms they contain.
[27] FCM The organization of online learning material.
[28,29] Agglomerative, Direct k-way, To cluster e-learning documents into meaningful groups to
repeated bisection and graph find out more refined subconcepts. This information is used
partitional from the clustering for improving the content search in e-learning platforms.
CLUTO package
[30] Hierarchical agglomerative To construct an e-learning FAQ concept hierarchy. Rough set
clustering theory is used to classify users’ queries.
[31] Web text clustering based on To group e-learning documents based on their frequent
maximal frequent itemsets word similarities.
In order to improve the content and organization of the resources of virtual courses,
clustering methods concerned with the evaluation of learning materials, such as those
presented here, could be used to the advantage of all those involved in the learning pro-
cess. It is our opinion that, beyond the application of clustering techniques, a sensible
improvement in the development of course material evaluation processes may come from
the exploration of groups according to their web usage patterns. Subsequently, association
rules, for instance, could be applied to explore the relationships between the usability of
the course materials and the students’ learning performance, on the basis of the informa-
tion gathered from the interaction between the user and the learning environment.
Moreover, if we can perform students’ evaluation from their system usability behavior,
this outcome could also indirectly be used to improve the course resources. For instance,
if the students’ evaluation was unsatisfactory, it could hint to the fact that the course
resources and learning materials are inadequate and therefore should be changed and/or
improved based on the successful students’ navigational paths.
TABLE 6.2
Research in Clustering Analysis to Group e-Learning Students
References Clustering Method e-Learning Objective
[32,33] SOM Students’ evaluation in a tutorial supervisor system.
[34] Weighted Euclidian distance- To group students according to the purpose of the
based clustering recommendation in collaborative environments.
[35] Weighted Euclidian distance- To promote group-based collaborative learning and to
based clustering provide incremental student diagnosis.
[36] SOM To group students based on their background. After that,
association rule mining (a priori) is performed to provide
course recommendations.
[37–39] GTM, t-GTM, and FRD-GTM The clustering and visualization of multivariate data
concerning the behavior of the students of a virtual course,
including atypical learning behavior analysis and feature
relevance ranking.
[40] EM To group students into clusters according to their browsing
behaviors.
[41] EM To discover user behavior patterns in collaborative activities.
[42] FCM To group students based on student’ characteristics.
Although the proposed tool is able to work with
n-dimensional spaces, at the moment of publication, it is
able to use a maximum of three attributes.
[43] FCM To group students based on their personality and learning
strategy.
[44] Fuzzy set clustering To group similar students into homogeneous classes and to
provide personalized learning to students.
[45] Matrix-based clustering method To group students based on personal attributes.
[46] Hierarchical Euclidean distance- To group students’ levels in ICT based on the answers
based clustering provided from e-questionnaires.
[47] Weighted Euclidian distance- To provide students’ grouping according to their
based clustering navigational behavior. In this study, the authors generated
and used simulated navigational data.
[48] k-Means To cluster students based on the performance similarities.
[49] k-Means To improve webpage access prediction performance.
Association rule mining and Markov models.
[50] Two-phase hierarchical To group students based on their learning styles and
clustering usability preferences.
[51] Naïve Bayes To form e-learning interactive students’ groups.
of atypical (outlier) students. On the other hand, they were interested in estimating the
relative relevance of the available data features [46]. The results showed that useful knowl-
edge can be extracted from the t-GTM combination of outlier detection, feature relevance
determination, and data clustering and visualization [48]. This knowledge could be fed
back into the e-learning system in order to provide students with personalized guidance,
tailored to their inhomogeneous needs and requirements.
In [49], user actions associated to students’ web usage were gathered and preprocessed
as part of a data mining process. The EM algorithm was then used to group the users into
clusters according to their behaviors. These results could be used by teachers to provide
specialized advice to students belonging to each cluster. The EM algorithm was also the
method of choice in [50], where clustering was used to discover user behavior patterns in
collaborative activities in e-learning applications.
Clustering Educational Data 87
time, in a more effective way than if we tried to do it individually. This option, using clus-
tering methods, could reduce teachers’ workload.
6.4╇ Conclusions
The pervasiveness of the Internet has enabled online distance education to become far
more mainstream than it used to be, and that has happened in a surprisingly short time.
e-Learning course offerings are now plentiful, and many new e-learning platforms and
systems have been developed and implemented with varying degrees of success. These
Clustering Educational Data 89
systems generate an exponentially increasing amount of data, and much of this informa-
tion has the potential to become new and usable knowledge to improve all instances of
e-learning. Data mining clustering processes should enable the extraction of this knowl-
edge and facilitate its use.
It is still early days for the integration of data mining in e-learning systems and not
many real and fully operative implementations are available. Nevertheless, a good deal
of academic research in this area has been published over the last few years. Much of
it concerns the design and application of clustering methods. In this chapter, we have
reviewed, in some detail, recent research on clustering as applied to e-learning, dealing
with problems of students’ learning assessment, learning materials and course evaluation,
and course adaptation based on students’ learning behavior. It has been argued that the
use of clustering strategies should ease the reutilization of the knowledge generated by
data mining processes, as well as reduce the costs of educative personalization processes.
Acknowledgment
Félix Castro is a research fellow within the PROMEP program of the Mexican Secretary of
Public Education.
References
1. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R.,
CRIPS-DM 1.0 step by step data mining guide. CRISP-DM Consortium, August 2000.
2. Castro, F., Vellido, A., Nebot, A., and Mugica, F., Applying data mining techniques to e-learning
problems, in Evolution of Teaching and Learning Paradigms in Intelligent Environment, Studies in
Computational Intelligence (SCI), Vol. 62, Jain, L.C., Tedman, R.A. and Tedman, D.K. (Eds.),
pp. 183–221, 2007.
3. MacQueen, J., Some methods for classification and analysis of multivariate observations, in
Fifth Berkeley Symposium on Mathematics, Statistics and Probability, University of California Press,
Berkeley, CA, pp. 281–297, 1967.
4. Dunn, J.C., A fuzzy relative of the ISODATA process and its use in detecting compact well-
separated clusters. J. Cybern. 3, 32–57, 1973.
5. Pelleg, D. and Moore, A., Accelerating exact k-means algorithms with geometric reasoning, in
The Fifth International Conference on Knowledge Discovery in Databases, AAAI Press, Menlo Park,
CA, pp. 277–281, 1999.
6. McLachlan, G.J. and Basford, K.E., Mixture Models: Inference and Applications to Clustering.
Marcel Dekker, New York, 1988.
7. Blei, D.M., Ng, A.Y., and Jordan, M.I., Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022,
2003.
8. Oja, M., Kaski, S., and Kohonen, T., Bibliography of self-organizing map (SOM). Neural Comput.
Surveys 3, 1–156, 2002.
9. Kohonen, T., Self-Organizing Maps, 3rd edn., Springer-Verlag, Berlin, Germany, 2001.
10. Bishop, C.M., Svensén, M., and Williams, C.K.I., GTM: The generative topographic mapping.
Neural Comput. 10(1), 215–234, 1998.
11. Hore, P., Hall, L.O., and Goldgof, D.B., A scalable framework for cluster ensembles. Pattern
Recogn. 42(5), 676–688, 2009.
90 Handbook of Educational Data Mining
12. Busse, L.M., Orbanz, P., and Buhmann, J.M., Cluster analysis of heterogeneous rank data, in
Proceedings of the 24th International Conference on Machine Learning (ICML), ACM, New York,
pp. 113–120, 2007.
13. Andrews, N.O. and Fox, E.A., Recent developments in document clustering. Technical Report,
TR-07-35. Department of Computer Science, Virginia Tech, Blacksburg, VA, 2007.
14. Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666,
2010.
15. Tsuda, K. and Kudo, T., Clustering graphs by weighted substructure mining, in Proceedings
of the 23rd International Conference on Machine Learning (ICML), ACM, New York, pp. 953–960,
2006.
16. Backstrom, L., Huttenlocher, D., Kleinberg, J., and Lan, X., Group formation in large social net-
works: Membership, growth, and evolution, in Proceedings of the 12th International Conference on
Knowledge Discovery and Data Mining, Philadelphia, PA, 2006.
17. Mao, J. and Jain, A.K., A self-organizing network for hyperellipsoidal clustering (HEC), IEEE
Trans. Neural Netw. 7, 16–29, 1996.
18. Ball, G. and Hall, D., ISODATA, a novel method of data analysis and pattern classification.
Technical report NTIS AD 699616. Stanford Research Institute, Stanford, CA, 1965.
19. Forgy, E.W., Cluster analysis of multivariate data: Efficiency vs. interpretability of classifica-
tions. Biometrics 21, 768–769, 1965.
20. Eschrich, S., Ke, J., Hall, L.O., and Goldgof, D.B., Fast accurate fuzzy clustering through data
reduction. IEEE Trans. Fuzzy Syst. 11(2), 262–270, 2003.
21. Witten, I.H. and Eibe, F., Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.
Morgan Kaufman-Elsevier, San Francisco, CA, 2005.
22. Jain, A.K., Murty, M.N., and Flynn, P.J., Data clustering: A review. ACM Comput. Serv. 31(3),
264–323, 1999.
23. Mitra, S. and Acharya, T., Data Mining: Multimedia, Soft Computing, and Bioinformatics. John
Wiley & Sons, Hoboken, NJ, 2003.
24. Dempster, A.P., Laird, M.N., and Rubin, D.B., Maximum likelihood from incomplete data via
the EM algorithm. J. R. Stat. Soc. 39(1), 1–38, 1977.
25. Vellido, A., Missing data imputation through gtm as a mixture of t-distributions. Neural Netw.
19(10), 1624–1635, 2006.
26. Vellido, A., Assessment of an unsupervised feature selection method for generative topo-
graphic mapping, in The 16th International Conference on Artificial Neural Networks (ICANN’2006),
Athens, Greece, LNCS (4132) Springer-Verlag, Berlin, Germany, pp. 361–370, 2006.
27. Drigas, A. and Vrettaros, J., An intelligent tool for building e-learning contend-material using
natural language in digital libraries. WSEAS Trans. Inf. Sci. Appl. 5(1), 1197–1205, 2004.
28. Hammouda, K. and Kamel, M., Data mining in e-learning, in e-Learning Networked Environments
and Architectures: A Knowledge Processing Perspective, Pierre, S. (Ed.). Springer-Verlag, Berlin,
Germany, 2005.
29. Tane, J., Schmitz, C., and Stumme, G., Semantic resource management for the web: An
e-learning� application, in The Proceedings 13th World Wide Web Conference, WWW2004, Fieldman,
S., Uretsky, M. (Eds.). ACM Press, New York, pp. 1–10, 2004.
30. Hwang, G.J., A test-sheet-generating algorithm for multiple assessment requirements. IEEE
Trans. Educ. 46(3), 329–337, 2003.
31. Carpenter, G., Grossberg, S., and Rosen, D.B., Fuzzy ART: Fast stable learning and categoriza-
tion of analog patterns by an adaptive resonance system. Neural Netw. 4, 759–771, 1991.
32. Dreyfus, S.E. and Law, A.M., The Art and Theory of Dynamic Programming, Academic Press, New
York, 1977.
33. Kim, J., Chern, G., Feng, D., Shaw, E., and Hovy, E., Mining and assessing discussions on the
web through speech act analysis, in The AAAI Workshop on Web Content Mining with Human
Language Technology, Athens, GA, pp. 1–8, 2006.
34. Chu, W.W. and Chiang, K., Abstraction of high level concepts from numerical values in data-
bases, in The AAAI Workshop on Knowledge Discovery in Databases, Seattle, WA, July 1994.
Clustering Educational Data 91
35. Pirrone, R., Cossentino, M., Pilato, G., and Rizzo, R., Concept maps and course ontology:
A multi-level approach to e-learning, in The II AI*IA Workshop on Artificial Intelligence and
E-learning, Pisa, Italy, September 23–26, 2003.
36. Mendes, M.E.S., Martinez, E., and Sacks, L., Knowledge-based content navigation in e-learning
applications, in The London Communication Symposium, London, U.K., September 9–10, 2002.
37. Zhuhadar, L. and Nasraoui, O., Personalized cluster-based semantically enriched web search
for e-learning, in The ONISW’08, Napa Valley, CA, ACM, New York, pp. 105–111, 2008.
38. Zhuhadar, L. and Nasraoui, O., Semantic information retrieval for personalized e-learning, in
The 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE Computer Society,
Dayton, OH, pp. 364–368, November 3–5, 2008.
39. Chiu, D.Y., Pan, Y.C., and Chang, W.C., Using rough set theory to construct e-learning FAQ
retrieval infrastructure, in The First IEEE International Conference on Ubi-Media Computing,
Lanzhou, China, pp. 547–552, July 15–16, 2008.
40. Su, Z., Song, W., Lin, M., and Li, J., Web text clustering for personalized e-learning based on
maximal frequent itemsets, in The International Conference on Computer Science and Software
Engineering, IEEE Computer Science, Hubei, China, pp. 452–455, December 12–14, 2008.
41. Mullier, D., A tutorial supervisor for automatic assessment in educational systems. Int. J.
e-Learn. 2(1), 37–49, 2003.
42. Mullier, D., Moore, D., and Hobbs, D., A neural-network system for automatically assessing
students, in World Conference on Educational Multimedia, Hypermedia and Telecommunications,
Kommers, P. and Richards, G. (Eds.), AACE, Chesapeake, VA, pp. 1366–1371, 2001.
43. Tang, T.Y. and McCalla, G., Smart recommendation for an evolving e-learning system:
Architecture and experiment. Int. J. e-Learn. 4(1), 105–129, 2005.
44. McCalla, G., The ecological approach to the design of e-learning environments: Purpose-
based capture and use of information about learners. J. Int. Media Educ., Special Issue on the
Educational Semantic Web, 2004.
45. Tai, D.W.S., Wu, H.J., and Li, P.H., A hybrid system: Neural network with data mining in an
e-learning environment, in KES’2007/WIRN’2007, Part II, B. Apolloni et al. (Eds.), LNAI 4693,
Springer-Verlag, Berlin, Germany, pp. 42–49, 2007.
46. Castro, F., Vellido, A., Nebot, A., and Minguillón, J., Finding relevant features to character-
ize student behavior on an e-learning system, in The International Conference on Frontiers in
Education: Computer Science and Computer Engineering (FECS’05), Hamid, R.A. (Ed.), Las Vegas,
NV, pp. 210–216, 2005.
47. Castro, F., Vellido, A., Nebot, A., and Minguillón, J., Detecting atypical student behaviour
on an e-learning system, in VI Congreso Nacional de Informática Educativa, Simposio Nacional de
Tecnologías de la Información y las Comunicaciones en la Educación (SINTICE’2005), Granada, Spain,
pp. 153–160, 2005.
48. Vellido, A., Castro, F., Nebot, A., and Mugica, F., Characterization of atypical virtual campus
usage behavior through robust generative relevance analysis, in The 5th IASTED International
Conference on Web-Based Education (WBE’2006), Puerto Vallarta, Mexico, pp. 183–188, 2006.
49. Teng, C., Lin, C., Cheng, S., and Heh, J., Analyzing user behavior distribution on e-learning
platform with techniques of clustering, in Society for Information Technology and Teacher Education
International Conference, Atlanta, GA, pp. 3052–3058, 2004.
50. Talavera, L. and Gaudioso, E., Mining student data to characterize similar behavior groups in
unstructured collaboration spaces, in Workshop in Artificial Intelligence in Computer Supported
Collaborative Learning, in conjunction with 16th European Conference on Artificial Intelligence
(ECAI’2003), Valencia, Spain, pp. 17–22, 2004.
51. Christodoulopoulos, C.E. and Papanikolaou, K.A., A group formation tool in a e-learning con-
text, in The 19th IEEE International Conference on Tools with Artificial Intelligence, IEEE Computer
Society, Patras, Greece, pp. 117–123, October 29–31, 2007.
52. Tian, F., Wang, S., Zheng, C., and Zheng, Q., Research on e-learner personality grouping
based on fuzzy clustering analysis, in The 12th International Conference on Computer Supported
Cooperative Work in Design (CSCWD’2008), Xi’an, China, pp. 1035–1040, April 16–18, 2008.
92 Handbook of Educational Data Mining
53. Lu, F., Li, X., Liu, Q., Yang, Z., Tan, G., and He, T., Research on personalized e-learning system
using fuzzy set based clustering algorithm, in ICCS’2007, Y. Shi et al. (Eds.), Part III, LNCS
(4489), Springer-Verlag, Berlin, Germany, pp. 587–590, 2007.
54. Zhang, K., Cui, L., Wang, H., and Sui, Q., An improvement of matrix-based clustering method
for grouping learners in e-learning, in The 11th International Conference on Computer Supported
Cooperative Work in Design (CSCWD’2007), Melbourne, Australia, pp. 1010–1015, April 26–28,
2007.
55. Mylonas, P., Tzouveli, P., and Kollias, S., Intelligent content adaptation in the framework of an
integrated e-learning system, in The International Workshop in Combining Intelligent and Adaptive
Hypermedia Methods/Techniques in Web Based Education Systems (CIAH’2005), Salzburg, Austria,
pp. 59–66, September 6–9, 2005.
56. Tang, T.Y. and Chan, K.C., Feature construction for student group forming based on their
browsing behaviors in an e-learning system, in PRICAI’2002, Ishizuka, M. and Sattar, A. (Eds.),
LNAI (2417), Springer-Verlag, Berlin, Germany, pp. 512–521, 2002.
57. Manikandan, C., Sundaram, M.A.S., and Mahesh, B.M., Collaborative e-learning for remote
education-an approach for realizing pervasive learning environments, in 2nd International
Conference on Information and Automation (ICIA’2006), Colombo, Srilanka, pp. 274–278,
December 14–17, 2006.
58. Khalil, F., Li, J., and Wang, H., Integrating recommendation models for improved web
page prediction accuracy, in The 31st Australasian Computer Science Conference (ACSC’2008),
Conference in Research and Practice in Information Technology (CRPIT’2008), Vol. 74, Wollongong,
Australia, pp. 91–100, 2008.
59. Zakrzewska, D., Using clustering techniques for students’ grouping in intelligent e-learning
systems, in USAB 2008, Holzinger, A. (Ed.), LNCS (5298), Springer-Verlag, Berlin, Germany,
pp. 403–410, 2008.
60. Yang, Q., Zheng, S., Huang, J., and Li, J., A design to promote group learning in e-learning by
naive Bayesian, Comput. Intel. Des. 2, 379–382, 2008.
61. Fu, H., and Foghlu, M.O., A conceptual subspace clustering algorithm in e-learning, in The 10th
International Conference on Advances in Communication Technology (ICACT’2008), IEEE Conference
Proceeding, Phoenix Park, Republic of Korea, pp. 1983–1988, 2008.
62. Chan, C., A Framework for assessing usage of web-based e-learning systems, in The Second
International Conference on Innovative Computing, Information and Control (ICICIC’2007),
Kumamoto, Japan, pp. 147–150, September 5–7, 2007.
63. Myszkowski, P.B., Kwaśnicka, H., and Markowska, U.K., Data mining techniques in e-learning
CelGrid system, in The 7th Computer Information Systems and Industrial Management Applications,
IEEE Computer Science, Ostrava, the Czech Republic, pp. 315–319, June 26–28, 2008.
7
Association Rule Mining in Learning
Management Systems
Contents
7.1 Introduction........................................................................................................................... 93
7.2 Background............................................................................................................................ 94
7.3 Drawbacks of Applying Association Rule in e-Learning............................................... 96
7.3.1 Finding the Appropriate Parameter Settings of the Mining Algorithm.......... 97
7.3.2 Discovering Too Many Rules.................................................................................. 97
7.3.3 Discovery of Poorly Understandable Rules.......................................................... 98
7.3.4 Statistical Significance of Discovered Rules......................................................... 99
7.4 An Introduction to Association Rule Mining with Weka in a Moodle LMS............. 100
7.5 Conclusions and Future Trends........................................................................................ 104
Acknowledgments....................................................................................................................... 104
References...................................................................................................................................... 104
7.1╇ Introduction
Learning management systems (LMSs) can offer a great variety of channels and workspaces
to facilitate information sharing and communication among participants in a course. They
let educators distribute information to students, produce content material, prepare assign-
ments and tests, engage in discussions, manage distance classes, and enable collaborative
learning with forums, chats, file storage areas, news services, etc. Some examples of com-
mercial systems are Blackboard [1], WebCT [2], and Top-Class [3], while some examples of
free systems are Moodle [4], Ilias [5], and Claroline [6]. One of the most commonly used
is Moodle (modular object–oriented developmental learning environment), a free learn-
ing management system enabling the creation of powerful, flexible, and engaging online
courses and experiences [42].
These e-learning systems accumulate a vast amount of information that is very valuable
for analyzing students’ behavior and could create a gold mine of educational data [7]. They
can record any student activities involved, such as reading, writing, taking tests, perform-
ing various tasks, and even communicating with peers. They normally also provide a data-
base that stores all the system’s information: personal information about the users (profile),
and academic results and users’ interaction data. However, due to the vast quantities of
data these systems can generate daily, it is very difficult to manage data analysis manually.
93
94 Handbook of Educational Data Mining
Instructors and course authors demand tools to assist them in this task, preferably on a
continual basis. Although some platforms offer some reporting tools, it becomes hard for a
tutor to extract useful information when there are a great number of students [8]. The cur-
rent LMSs do not provide specific tools allowing educators to thoroughly track and assess
all learners’ activities while evaluating the structure and contents of the course and its
effectiveness for the learning process [9]. A very promising area for attaining this objective
is the use of data mining. Data mining or knowledge discovery in databases (KDD) is the
automatic extraction of implicit and interesting patterns from large data collections. Next to
statistics and data visualization, there are many data mining techniques for analyzing the
data. Some of the most useful data mining tasks and methods are clustering, classification,
and association rule mining. These methods uncover new, interesting, and useful knowl-
edge based on users’ usage data. In the last few years, researchers have begun to apply data
mining methods to help instructors and administrators to improve e-learning systems [10].
Association rule mining has been applied to e-learning systems traditionally for asso-
ciation analysis (finding correlations between items in a dataset), including, e.g., the fol-
lowing tasks: building recommender agents for online learning activities or shortcuts [11],
automatically guiding the learner’s activities and intelligently generating and recommend-
ing learning materials [12], identifying attributes characterizing patterns of performance
disparity between various groups of students [13], discovering interesting relationships
from a student’s usage information in order to provide feedback to the course author [14],
finding out the relationships between each pattern of a learner’s behavior [15], finding stu-
dent mistakes often occurring together [16], guiding the search for the best fitting transfer
model of student learning [17], optimizing the content of an e-learning portal by determin-
ing the content of most interest to the user [18], extracting useful patterns to help educators
and web masters evaluating and interpreting online course activities [11], and personal-
izing e-learning based on aggregate usage profiles and a domain ontology [19].
Association rules mining is one of the most well studied data mining tasks. It discov-
ers relationships among attributes in databases, producing if-then statements concerning
attribute-values [20]. Association rule mining has been applied to web-based education
systems from two points of view: (1) help professors to obtain detailed feedback of the
e-learning process: e.g., finding out how the students learn on the web, to evaluate the
students based on their navigation patterns, to classify the students into groups, and to
restructure the contents of the Web site to personalize the courses; and (2) help students in
their interaction with the e-learning system: e.g., adaptation of the course according to the
apprentice’s progress, e.g., by recommending to them personalized learning paths based
on the previous experiences by other similar students.
This paper is organized in the following way: First, we describe the background of asso-
ciation rule mining in general and more specifically its application to e-learning. Then, we
describe the main drawbacks and some possible solutions for applying association rule
algorithms in LMSs. Next, we show a practical tutorial of using an association rule mining
algorithm over data generated from a Moodle system. Finally, the conclusions and further
research are outlined.
7.2╇ Background
IF–THEN rules are one of the most popular ways of knowledge representation, due to their
simplicity and comprehensibility [21]. There are different types of rules according to the
Association Rule Mining in Learning Management Systems 95
TABLE 7.1
IF-THEN Rule Format
IF <antecedent> THEN
<rule> ::= <consequent>
<antecedent> ::= <condition> +
<consequent> ::= <condition> +
<condition> ::= <attribute> <operator> <value>
<attribute> ::= Each one of the possible attributes set
<value> ::= Each one of the possible values of the
attributes set
<operator> ::= = |>|<|≥|≤| ≠
data mining technique used, for example: classification, association, sequential pattern
analysis, prediction, causality induction, optimization, etc. In the area of KDD, the most
studied ones are association rules, classifiers, and predictors. An example of a generic rule
format IF-THEN in Extend Backus Naur Form (EBNF) notation is shown in Table 7.1.
Before beginning the study in depth of the main association rule mining algorithms, we
first formally define what an association rule is [22].
Let Iâ•›=â•›{i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where
each transaction T is a set of items such that T ⊆ I. Each transaction is associated with a
unique identifier, called its TID. We say that a transaction T contains X, a set of some items
in I, if Xâ•›⊆ I. An association rule is an implication of the form X ⇒ Y, where Xâ•›⊂ I, Y ⊂ I, and
Xâ•›∩ Yâ•›=â•›Ø. The rule X ⇒ Y holds in the transaction set D with confidence c if c% of transac-
tions in D that contain X also contain Y. The rule Xâ•›⇒â•›Y has support s in the transaction
set D if s% of transactions in D contain Xâ•›∪â•›Y. An example association rule is: “IF diapers
in a transaction, THEN beer is in the transaction as well in 30% of the cases. Diapers and
beer are bought together in 11% of the rows in the database.” In this example, support and
confidence of the rule are
s = P(X ∪ Y ) = 11%
Y P(X ∪ Y ) s (X ∪ Y )
c = P = = = 30%
X P (X ) s (X )
FIGURE 7.1
Pseudo-code of Apriori algorithm.
FIGURE 7.2
Apriori candidate generation.
pass are used to generate the candidate itemsets Ck+1 as it is shown in Figure 7.2. Next, the
database is scanned and the support of candidates in Ck+1 is counted.
There are a lot of different association rule algorithms. A comparative study between the
main algorithms that are currently used to discover association rules can be found in [23]:
Apriori [22], FP-Growth [24], MagnumOpus [25], and Closet [26]. Most of these algorithms
require the user to set two thresholds, the minimal support and the minimal confidence,
and find all the rules that exceed the thresholds specified by the user. Therefore, the user
must possess a certain amount of expertise in order to find the right settings for support
and confidence to obtain the best rules.
or facilitate user interpretation by reducing the result set size and by incorporating
domain knowledge [28].
Most of the current data mining tools are too complex for educators to use, and their
features go well beyond the scope of what an educator might require. As a result, the
course administrator is more likely to apply data mining techniques in order to produce
reports for instructors who then use these reports to make decisions about how to improve
the student’s learning and the online courses. Nowadays, normally, data mining tools
are designed more for power and flexibility than for simplicity. There are also other spe-
cific problems related to the application of association rule mining to e-learning data.
Next, we are going to describe some of the main drawbacks of association rule algo-
rithms in e-learning.
measures has been suggested [32], such as support and confidence, mentioned �previously,
as well as purely statistical measures such as the chi-square statistic and the correla-
tion coefficient in order to measure the dependency inference between data variables.
Subjective measures are becoming increasingly important [33], in other words measures
that are based on subjective factors controlled by the user.
Most of the subjective approaches involve user participation in order to express, in accor-
dance with his or her previous knowledge, which rules are of interest. Some suggested
subjective measures [34] are: unexpectedness (rules are interesting if they are unknown to
the user or contradict the user’s knowledge) and actionability (rules are interesting if users
can do something with them to their advantage).
Liu et al. [34] proposed an interestingness analysis system (IAS) that compares rules
discovered with the user’s knowledge about the area of interest. Let U be the set of a
user’s specification representing his or her knowledge space, and A be the set of discov-
ered �association rules, this algorithm implements a pruning technique for removing
redundant or insignificant rules by ranking and classifying them into four categories:
conforming rules, unexpected consequent rules, unexpected condition rules, and both-
side unexpected rules. The degrees of membership into each of these four categories are
used for ranking the rules. Using their own specification language, they indicate their
knowledge about the matter in question, through relationships among the fields or items
in the database.
Finally, another approximation is that we can see the knowledge database as a rule repos-
itory on the basis of which subjective analysis of the rules discovered can be performed
[35]. Before running the association rule mining algorithm, the teacher could download
the relevant knowledge database, in accordance with his or her profile. The personaliza-
tion of the rules returned is based on filtering parameters, associated with the type of the
course to be analyzed, such as the area of knowledge, the level of education, the difficulty
of the course, etc. The rules repository is created on the server in a collaborative way where
the experts can vote for each rule in the repository, based on the educational consider-
ations and their experience gained in other similar e-learning courses.
teacher. In the context of web-based educational systems, we can identify some �common
attributes to a variety of e-learning systems such as LMS and adaptive hypermedia
courses, such as visited, time, score, difficulty level, knowledge level, attempts, number of
messages, etc. In this context, the use of standard metadata for e-learning [38] allows the
creation and maintenance of a common knowledge base with a common vocabulary sus-
ceptible to sharing among different communities of instructors or creators of e-learning
courses.
The Sharable Content Object Reference Model (SCORM) [39] standard describes a con-
tent aggregation model and a tracking model for reusable learning objects to support
adaptive instruction based on the learner’s objectives, preferences, performance, and other
factors like instructional techniques. One of the most important features of SCORM is that
it allows the instructional content designer to specify sequencing rules and navigation
behavior while maintaining the possibility of reusing learning resources within multiple
and different aggregation contexts. While SCORM provides a framework for the repre-
sentation and processing of the metadata, it falls short in including the needed support
for more specific pedagogical tracking such as the use of collaborative resources. We thus
argue that the use of a more specific pedagogical ontology provides a higher level of deci-
sion support analysis and mining, based in qualitative issues like the collaborative degree
of activities.
Lastly, we consider very important to mention another aspect that can facilitate the
�comprehensibility of discovered rules, the visualization. The goal of visualization is to
help analysts in inspecting the data after applying the mining task [40] by means of some
visual representation to the corpus of rules extracted. A range of visual representations is
used, such as tables, two-dimensional matrices, graphs, bar charts, grids, mosaic plots, par-
allel coordinates, etc. Association rule mining can be integrated with visualization tech-
niques [41] in order to allow users to drive the association rule finding process, giving them
control and visual cues to ease understanding of both the process and its results. However,
the visualization methods are still difficult to understand for a non-expert in data mining,
such as a teacher. Therefore, we consider that this question needs to be addressed in the
near future, and the challenge will be to apply these techniques in a more intuitive way,
identifying within these data structures the rules that are relevant and meaningful in the
context of the e-learning analysis, and representing them in a simple way such as icons,
colors, etc., using interactive bidimensional and three-dimensional representations.
• Collect data. The LMS system is used by students, and the usage and interaction
information is stored in the database. We are going to use the students’ usage data
of the Moodle system.
• Preprocess the data. The data are cleaned and transformed into a mineable for-
mat. In order to preprocess the Moodle data, we used the MySQL System Tray
Monitor and Administrator tools [43] and the Open DB Preprocess task in the
Weka Explorer [29].
• Apply association rule mining. The data mining algorithms are applied to dis-
cover and summarize knowledge of interest to the teacher.
Collect Moodle usage data Preprocess data Apply data mining algorithms
Data base
Interpret/evaluate/deploy results
FIGURE 7.3
Mining Moodle data.
Association Rule Mining in Learning Management Systems 101
• Interpret, evaluate, and deploy the results. The obtained results or model are
interpreted and used by the teacher for further actions. The teacher can use the
discovered information for making decision about the students and the Moodle
activities of the course in order to improve the students’ learning.
Our objective is to use the students’ usage data of the Moodle system. Moodle has a lot
of detailed information about content, users, usage, etc. that is stored in a relational data-
base. The data is stored in a single database: MySQL and PostgreSQL are best supported,
but it can also be used with Oracle, Access, and Interbase, any database supporting Open
DataBase Connectivity (ODBC) connections and others. We have used MySQL because it
is the world’s most popular open source database. Moodle has more than 150 tables, some
of the most important are mdl_log, mdl_assignement, mdl_chat, mdl_forum, mdl_message, and
mdl_quiz.
Data preprocessing allows for transforming the original data into a suitable shape to be
used by a particular data mining algorithm or framework. So, before applying a data min-
ing algorithm, a number of general data preprocessing tasks have to be addressed (data
cleaning, user identification, session identification, path completion, transaction identifi-
cation, data transformation and enrichment, data integration, and data reduction). Data
preprocessing of LMS-generated data has some specific issues:
• Moodle and most of the LMSs use user authentication (password protection) in
which logs have entries identified by users since the users have to log-in, and
sessions are already identified since users may also have to log-out. So, we can
remove the typical user and session identification tasks of preprocessing data of
web-based systems.
• Moodle and most of the LMSs record the students’ usage information not only in
log files but also directly in relational databases. The tracking module can store
user interactions at a higher level than simple page access. Databases are more
powerful, flexible and less bug-prone that typical log text files for data gathering
and integration.
Therefore, the data gathered by an LMS may require less cleaning and preprocessing than
data collected in other systems based on log files. In our case, we are going to do the fol-
lowing preprocessing tasks:
• Selecting data. The first task is to choose in what specific Moodle courses we are
interested to use for mining.
• Creating summary tables. Starting from the selected course, we create summari-
zation tables that aggregate the information at the required level (e.g., student).
Student and interaction data are spread over several tables. We have created a new
summary table that integrates the most important information for our objective.
This table has a summary, by row, about all the activities done by each student in
the course and the final obtained mark by the student in the same course. In order
to create this table, it is necessary to do several queries to the database in order to
obtain the information of the desired students (userid value from mdl_user_stu-
dents table) and courses (id value of the mdl_course table). Table 7.2 shows the
main attributes used to create summarization tables.
102 Handbook of Educational Data Mining
TABLE 7.2
Attributes Used for Each Student
Name Description
course Identification number of the course
n_assigment Number of realized assignments
n_quiz Number of realized quizzes
n_quiz_a Number of passed quizzes
n_quiz_s Number of failed quizzes
n_messages Number of send messages to the chat
n_messages_ap Number of send messages to the teacher
n_posts Number of send messages to the forum
n_read Number or read messages of the forum
total_time_assignment Total time used in assignment
total_time_quiz Total time used in quiz
total_time_forum Total time used in forum
mark Final mark the student obtained in the course
Once the data is prepared, we can apply the association rule mining algorithm. In order
to do so, it is necessary (1) to choose the specific association rule mining algorithm; (2) to
configure the parameters of the algorithm; (3) to identify which table will be used for the
mining; (4) and to specify some other restrictions, such as the maximum number of items
and what specific attributes can be present in the antecedent or consequent of the discov-
ered rules.
In this case, we have used the Apriori algorithm [20] for finding association rules over
the discretized summarization table of the course 110 (Projects), executing this algorithm
with a minimum support of 0.3 and a minimum confidence of 0.9 as parameters.
Projects is a 3° course subjects of computer science studies oriented to design, develop,
document, and maintain a full computer science project. Students have to use Moodle
Association Rule Mining in Learning Management Systems 103
FIGURE 7.4
Results of Apriori algorithm.
Acknowledgments
The authors gratefully acknowledge the financial support provided by the Spanish
Department of Research under TIN2008-06681-C06-03 and P08-TIC-3720 Projects.
References
1. BlackBoard, available at http://www.blackboard.com/ (accessed March 16), 2009.
2. WebCT, available at http://www.webct.com/ (accessed March 16), 2009.
3. TopClass, available at http://www.topclass.nl/ (accessed March 16), 2009.
4. Moodle, available at http://moodle.org/ (accessed March 16), 2009.
5. Ilias, available at http://www.ilias.de/ (accessed March 16), 2009.
6. Claroline, available at http://www.claroline.net/ (accessed March 16), 2009.
7. Mostow, J. and Beck, J., Some useful tactics to modify, map and mine data from intelligent
tutors. Natural Language Engineering 12(2), 195–208, 2006.
8. Dringus, L. and Ellis, T., Using data mining as a strategy for assessing asynchronous discussion
forums. Computer & Education Journal 45, 141–160, 2005.
9. Zorrilla, M.E., Menasalvas, E., Marin, D., Mora, E., and Segovia, J., Web usage mining proj-
ect for improving web-based learning sites. In Web Mining Workshop, Cataluña, Spain, 2005,
pp. 1–22.
10. Romero, C. and Ventura, S., Data Mining in E-Learning, Wit Press, Southampton, U.K., 2006.
11. Zaïane, O., Building a recommender agent for e-learning systems. In Proceedings of the
International Conference in Education (ICCE), Auckland, Nueva Zelanda, IEEE Press, New York,
2002, pp. 55–59.
Association Rule Mining in Learning Management Systems 105
12. Lu, J., Personalized e-learning material recommender system. In International Conference on
Information Technology for Application, Harbin, China, 2004, pp. 374–379.
13. Minaei-Bidgoli, B., Tan, P., and Punch, W., Mining interesting contrast rules for a web-based
educational system. In International Conference on Machine Learning Applications, Louisville, KY,
December 16–18, 2004.
14. Romero, C., Ventura, S., and De Bra, P., Knowledge discovery with genetic programming
for providing feedback to courseware author. User Modeling and User-Adapted Interaction: The
Journal of Personalization Research 14(5), 425–464, 2004.
15. Yu, P., Own, C., and Lin, L., On learning behavior analysis of web based interactive environ-
ment. In Proceedings of the ICCEE, Oslo, Norway, August 6–10, 2001.
16. Merceron, A. and Yacef, K., Mining student data captured from a web-based tutoring tool:
Initial exploration and results. Journal of Interactive Learning Research 15(4), 319–346, 2004.
17. Freyberger, J., Heffernan, N., and Ruiz, C., Using association rules to guide a search for best
fitting transfer models of student learning. In Workshop on Analyzing Student-Tutor Interactions
Logs to Improve Educational Outcomes at ITS Conference, Maceio, Brazil, August 30, 2004.
18. Ramli, A.A., Web usage mining using Apriori algorithm: UUM learning care portal case.
International Conference on Knowledge Management, Malaysia, 2005, pp. 1–19.
19. Markellou, P., Mousourouli, I., Spiros, S., and Tsakalidis, A., Using semantic web mining tech-
nologies for personalized e-learning experiences. In The 4th IASTED Conference on Web-Based
Education, WBE-2005, Grindelwald, Switzerland, February 21–23, 2005.
20. Agrawal, R., Imielinski, T., and Swami, A.N., Mining association rules between sets of items in
large databases. In Proceedings of SIGMOD, Washington, DC, 1993, pp. 207–216.
21. Klosgen, W. and Zytkow, J., Handbook of Data Mining and Knowledge Discovery, Oxford University
Press, New York, 2002.
22. Agrawal, R. and Srikant, R., Fast algorithms for mining association rules. In Proceedings of the
Conference on Very Large Data Bases, Santiago, Chile, September 12–15, 1994.
23. Zheng, Z., Kohavi, R., and Mason, L., Real world performance of association rule algorithms.
In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, San Francisco, CA, ACM, New York, 2001, pp. 401–406.
24. Han, J., Pei, J., and Yin, Y., Mining frequent patterns without candidate generation. In
Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD’99),
Philadelphia, PA, June, 1999, pp. 359–370.
25. Webb, G.I., OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial
Intelligence Research, 3, 431–465, 1995.
26. Pei, J., Han, J., and Mao, R. CLOSET: An efficient algorithm for mining frequent closed itemsets.
In Proceedings of ACM_SIGMOD International DMKD’00, Dallas, TX, 2000.
27. Ceglar, A. and Roddick, J.F., Association mining. ACM Computing Surveys, 38(2), 1–42, 2006.
28. Calders, T. and Goethals, B., Non-derivable itemset mining. Data Mining and Knowledge
Discovery 14, 171–206, 2007.
29. Weka, available at http://www.cs.waikato.ac.nz/ml/weka/ (accessed March 16), 2009.
30. Scheffer, T., Finding association rules that trade support optimally against confidence. Lecture
Notes in Computer Science 2168, 424–435, 2001.
31. García, E., Romero, C., Ventura, S., and De Castro, C., Using rules discovery for the continuous
improvement of e-learning courses. In Proceedings of the 7th International Conference on Intelligent
Data Engineering and Automated Learning-IDEAL 2006, LNCS 4224, Burgos, Spain, September
20–23, 2006, Springer-Verlag, Berlin, Germany, pp. 887–895.
32. Tan, P. and Kumar, V., Interesting Measures for Association Patterns: A Perspectiva, Technical Report
TR00-036. Department of Computer Science, University of Minnesota, Minneapolis, MN, 2000.
33. Silberschatz, A. and Tuzhilin, A., What makes patterns interesting in Knowledge discovery
systems. IEEE Transactions on Knowledge and Data Engineering 8(6), 970–974, 1996.
34. Liu, B., Wynne, H., Shu, C., and Yiming, M., Analyzing the subjective interestingness of associa-
tion rules. IEEE Intelligent Systems and their Applications 15(5), 47–55, 2000.
106 Handbook of Educational Data Mining
35. García, E., Romero, C., Ventura, S., and De Castro, C., An architecture for making recommen-
dations to courseware authors through association rule mining and collaborative filtering.
UMUAI: User Modelling and User Adapted Interaction 19(99), 100–132, 2009.
36. Huysmans, J., Baesens, B., and Vanthienen, J., Using Rule Extraction to Improve the
Comprehensibility of Predictive Models. http://ssrn.com/abstract=961358 (accessed January
20, 2008), 2006.
37. Dougherty, J., Kohavi, M., and Sahami, M., Supervised and unsupervised discretization of con-
tinuous features. International Conference on Machine Learning, Tahoe City, CA, 1995, pp. 194–202.
38. Brase, J. and Nejdl, W., Ontologies and metadata for e-learning. In Handbook on Ontologies,
Springer Verlag, Berlin, Germany, 2003, pp. 579–598.
39. SCORM, Advanced Distributed Learning. Shareable Content Object Reference Model: The SCORM
Overview, ADL Co-Laboratory, Alexandria, VA, available at http://www.adlnet.org (accessed
March 16, 2009), 2005.
40. De Oliveira, M.C.F. and Levkowitz, H., From visual data exploration to visual data mining: A
survey. IEEE Transactions on Visualization and Computer Graphics 9(3), 378–394, 2003.
41. Yamamoto, C.H. and De Oliveira, M.C.F., Visualization to assist users in association rule min-
ing tasks. PhD dissertation, Arizona State Mathematics and Computer Sciences Institute, Sao
Paulo University, Sao Paulo, Brazil. http://www.sbbd-sbes2005.ufu.br/arquivos/11866.pdf
(accessed March 16, 2009), 2004.
42. Rice, W.H., Moodle e-learning course development. In A Complete Guide to Successful Learning
Using Moodle, Packt Publishing, Birmingham, U.K./Mumbai, India, 2006.
43. Ullman, L., Guía de aprendizaje MySQL, Pearson Prentice Hall, Madrid, Spain, 2003.
44. Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan
Kaufman, San Francisco, CA, 2005.
8
Sequential Pattern Analysis of Learning
Logs: Methodology and Applications
Contents
8.1 Introduction......................................................................................................................... 107
8.2 Sequential Pattern Analysis in Education....................................................................... 108
8.2.1 Background.............................................................................................................. 108
8.2.2 Tracing the Contextualized Learning Process................................................... 109
8.3 Learning Log Analysis Using Data Mining Approach................................................. 112
8.3.1 Preprocessing: From Events to Learning Actions.............................................. 113
8.3.2 Pattern Discovery.................................................................................................... 114
8.3.3 Pattern Analysis: From Exploratory to Confirmatory Approach.................... 116
8.4 Educational Implications................................................................................................... 118
8.5 Conclusions and Future Research Directions................................................................ 119
Acknowledgments....................................................................................................................... 119
References...................................................................................................................................... 120
8.1╇ Introduction
Sequential pattern mining is an important challenge in the field of knowledge discovery
and data mining. It is a process of discovering and displaying previously unknown inter-
relationships, clusters, and data patterns with the goal of supporting improved decision-
making [1]. The mined knowledge can be used in a wide range of practical applications,
such as analyzing DNA sequences [2,3], stock price fluctuations [4], telecommunication
network intrusion detection [5], web usage patterns [6], customers’ transaction history [7],
and software structure evaluation [8].
The concept of sequential pattern analysis was first introduced by Agrawal and Srikant
[9], based on their study of customer purchase sequences. Briefly, given a set of sequences,
the problem is to discover subsequences that are frequent, in the sense that the occurrence
of such subsequences among data sequences exceeds a user-specified minimum support.
The support of a sequential pattern is the percentage of sequences that contain the pattern
[10]. In their classic example of book purchase records, a sequential pattern could be mani-
fested by customers who bought Asimov’s novels Foundation, Foundation and Empire, and
Second Foundation in that order. Customers who bought other books in between these three
transactions are still counted as manifesting the sequential pattern. As well, a set of items
can be an element of a sequential pattern, for example, Foundation and Ringworld, followed
107
108 Handbook of Educational Data Mining
protocols. Problems intrinsically arise in interpreting and generalizing such data, because
the content of thoughts that learners report in self-report protocols on learning behavior
only describes what they perceive about themselves in the context of remembered task condi-
tions. Yet, memory is subject to loss (forgetting), distortion (biased sampling of memories),
and reconstruction, which undermines the accuracy and reliability of responses. For these
reasons, direct evidence of learning strategies and motivations has been difficult to collect
and study with traditional methods.
In recent years, there has been increasing interest in the use of data mining to investigate
scientific questions in educational settings, fostering a new and growing research com-
munity for educational data mining [14]. With the emergence of computer-based learning
environments, traces of learners’ activities and behavior that are automatically generated
by computer systems allow researchers to make this latent knowledge explicit. The unob-
trusiveness of this method enables researchers to track learning events in a nonlinear envi-
ronment without disrupting the learner’s thinking or navigation through content. More
importantly, data obtained in real time allow “virtual” re-creation of learners’ actions,
which support grounded interpretations about how learners construct knowledge, and
track their actual choices as well as methods they use to express agency through self-
regulated learning within a particular context.
The extraction of sequential patterns in e-learning contributes to educational research
in various ways. It helps to evaluate learner’s activities and accordingly to adapt and cus-
tomize resource delivery [15], to be compared with expected behavioral patterns and the-
oretically appropriate learning paths [16], to indicate how to best organize educational
interfaces and be able to make suggestions to learners who share similar characteristics
[17], to generate personalized activities to different groups of learners [18], to identify inter-
action sequences indicative of problems and patterns that are markers of success [19], to
recommend to a student the most appropriate links or web pages within an adaptive web-
based course to visit next [20], to improve e-learning systems [21], and to identify predic-
tors of success by analyzing transfer student records [22]. All these applications suggest
that educational data mining has the potential to extend a much wider tool set to the
analysis of important questions in education.
Figure 8.1
A screenshot of nStudy user interface.
that afford multiple and varied options for exercising and expressing agency as they
construct knowledge.
As students select and use tools in nStudy, the system collects data on the fly about their
choices and constructions. Data are logged at the level of a software event and written to
an XML file time-stamped to the millisecond. The XML file is a transcript containing the
events that trace students’ interaction with information—the way it was operated on by
using nStudy’s tools and the time when that event occurred (see Figure 8.2). These events
represent small-grained traces such as moving from one web page to another, clicking a
button, opening a new window, or tagging a selected text.
As a learner uses each tool, nStudy records in detail all the events involved that make
up a composite studying action. For instance, when the student adds a label to information
selected in a sentence from the web page, nStudy logs information about which sentence
is being annotated, which tag is applied, the time when the tag is applied, and how long it
takes to complete the action. The whole series of actions logged during the student’s study-
ing session can be examined to make inferences about the students’ interactive goals, strat-
egies, and beliefs. These records support grounded interpretations about how a learner
constructs knowledge. Trace data reflect what learners do in ways that reveal more accu-
rately, although not perfectly, whether, when, and how learners access prior knowledge
to process new information. With sufficient samples of trace data, it may be possible to
identify standards learners use to make these decisions [24].
Given the complexity of learning logs captured in multimedia learning environments
and the intricacy of the cognitive processes that take place during information processing,
there are several challenges to be met when applying sequential pattern analysis to learn-
ing logs. The first challenge for any software is that given its broad functionality (such as
nStudy), the events recorded in the log file are unlikely to correspond exactly to the learning
tactics being examined by the researcher. If every mouse-click and keystroke is recorded as
a log event, each tactic is likely to generate multiple log events. The simple tactic of labeling
a text segment as “critical” might generate at least a text selection event followed by a menu
selection event, when the learner selects “critical” from a menu. It is even possible that the
sequence of log events generated by a tactic is interrupted by events a researcher deems
extraneous, such as repetitive attempts to select a piece of text due to unskilled mouse
- <lg: ViewEvent container= “Browser Window z_ci_2:nStudy Browser[User Homepage]: Mechanisms for Food
Preservation”
timeStamp=“2008–12–15T13:18:11.733–08:00”>
<lg:ButtonAction/>
<lg:componentId ID=“ID: BrowserWindow.ToolBar.FastMarkButton.NotePopupMenu.MenuItem —
Description: Menu item”/>
</lg:View Event>
- <lg:View Event container=“Browser Window z_ci_2:nStudy Browser[User Homepage]: Mechanisms for Food
Preservation”
timeStamp=“2008-12-15T13:18:13.327-08:00”>
<lg:Window Action action=“FocusLost”/>
<lg:componentId ID=“ID: BrowserWindow — Description: browser window”/>
</lg:View Event>
- <lg:ModelEvent action=“created” target=“link” timeStamp=“2008–12–15T13:18:13.770–08:00” user=“mzhou2”>
- <lg:targetObject>
- <lg:link>
- <lg:end0>
<lg:url>/mzhou2/section3.htm</lg:url>
- <lg:xpath>
xpointer(string-range(/html[1]/body[1]/div[1]/p[3]/span[1], “ ”, 2, 7)â•›)
</lg:xpath>
<lg:snip>he need</lg:snip>
</lg:end0>
- <lg:end1>
- <lg:mo>
<nt:Note author=“mzhou2” dateCreated=“2008-12-15T13:18:13.643-08:00”
dateModified=“2008-12-15T13:18:13.646-08:00”
templateRef=“fdgdfghh”/>
</lg:mo>
<1g:moID>10918</lg:moID>
</lg:end1>
</lg:link>
</lg:targetObject>
</lg:ModelEvent>
- <lg:ModelEvent action=“created” target=“link” timeStamp=“2008–12–15T13:18:16.875–08:00” user=“mzhou2”>
- <lg:targetObject>
- <lg:link>
- <lg:end0>
<lg:url>/mzhou2/section3.htm</lg:url>
- <lg:xpath>
xpointer(string-range(/html[1]/body[1]/div[1]/p[3], “ ”, 279, 7)â•›)
Figure 8.2
A screenshot of nStudy logs.
control. Hence, it is necessary to identify the train of log events corresponding to a tactic
and match that train to a stream of logged events infiltrated with noise. The challenge is to
model the sequence at a coarser level—to generate a sequence of higher-level actions from
the raw event logs—such that each action corresponds to a tactic deployed by a learner.
Second, existing sequential pattern mining algorithms, if applied directly in our sce-
nario, may generate excessive instances of irrelevant patterns that add to the difficulties
of producing a meaningful pattern analysis. This relates to a third and greater challenge,
namely, that learning strategies used by students are in general unknown a priori, and,
unlike the sequence of events corresponding to a tactic, are difficult to predict. Given a
large sample of learning logs, it is common for sequential mining to return thousands
of patterns, only a portion of which are educationally meaningful. The challenge is to
use the action stream produced by the previous step in the analysis to detect patterns of
actions corresponding to learning strategies. This is made more difficult by nonstrate-
gic or extraneous actions that act as noise in the action stream. Thus, it is necessary for
educational researchers to either inject their domain knowledge during the sequential
mining phase or apply the domain knowledge post hoc to effectively filter out relevant
patterns that address their research questions. Further, we must find the patterns (strat-
egies) that are repeated with relatively high frequency across a set of log files that might
correspond to, say, the learning sessions generated by all students enrolled in the same
course.
Learning logs like those generated as students study using nStudy offer a wealth of
data about how students process information during learning. Each learning log can
be viewed as a sequence of students’ self-directed interactions, called events, with their
learning environment. Sequential pattern analysis can be used to identify frequently
occurring sequences of events. In the following sections, we propose a complete method-
ology to address these challenges in sequential pattern analysis of learning logs, includ-
ing a flexible schema for modeling learning sequences at different levels. We introduce
two effective mechanisms for injecting research contexts and domain knowledge into
the pattern discovery process, and a constraint-based pattern filtering process using
nStudy log files generated by university students as examples to identify their studying
patterns.
Self
Action library reports
Statistics
Learning Log Action Log
Patterns analysis
logs parsing files mining
i.e., SPSS
Exploratory Confirmatory
analysis analysis
Figure 8.3
Flow of sequential pattern analysis on learning logs.
1. Preprocessing. In this step, the raw learning logs are taken as the input, consist-
ing of a complex series of low-level events spaced along a time dimension. They
are modeled at the behavioral unit or grain size that the educational researcher
wishes to study. This is done by a log parsing mechanism, which matches events
in a learning log to canonical action patterns predefined by the researcher in the
action library and generates a sequence of temporally ordered learner actions.
2. Pattern Discovery. These sequences are then fed into our sequential mining algo-
rithm to discover patterns across the learning logs.
3. Pattern Analysis–Evaluation. With the abundant sequential pattern generated
in the previous step, educational researchers identify interesting patterns in this
step, test research hypotheses, and perform conventional confirmatory data analy-
sis (compared to the exploratory data analysis in step 2) with the help of other
statistic tools such as SPSS.
– <lv:Actions>
– <lv:Action name=“View_Term”>
<lv:Description> View a glossary term in the glossary view. </lv:Description>
– <lv:Parameters>
<lv:Parameter name=“Template” srcEvent=“0” path=“ListViewSelection/targetObject/Name:name”
display=“True”/>
</lv:Parameters>
– <lv:Events>
<lv:ViewEvent target=“MainWindow.GlossaryView.GlossaryList” action=“ListViewSelection”/>
<lv:ViewEvent target=“MainWindow.GlossaryView.DetailPanel.Template Combo”
action=“ListViewSelection”/>
<lv:ViewEvent target=“MainWindow.GlossaryView.DetailPanel.Template Combo”
action=“ListViewSelection”/>
<lv:ViewEvent target=“MainWindow.GlossaryView.DetailPanel” action=“DataPanelModelChange”/>
</lv:Events>
</lv:Action>
– <lv:Action name=“Update_Term”>
– <lv:Description>
In the glossary view, select a glossary from the list and update.
</lv:Description>
– <lv:Parameters>
<lv:Parameter name=“Template” srcEvent=“0” path=“DataFieldSelectionAction/targetObject/Name:name”
display=“True”/>
</lv:Parameters>
– <lv:Events>
<lv:ViewEvent target=“MainWindow.GlossaryView.DetailPanel” action=“DataFieldSelectionAction”/>
<lv:ModelEvent target=“term” action=“updated”/>
</Iv:Events>
</Iv:Action>
Figure 8.4
A snapshot of the action library.
data would normally halt the shift-reduce parsing process, because traditional compilers
terminate with failure as soon as an error is detected in the input. To avoid this problem,
we incorporate the following mechanisms into our adapted shift-reduce parser to make it
robust enough to continue parsing in the presence of noise:
The output of this parsing analysis is a time-sequenced list of actions that represents the
students’ studying (see Figure 8.5). These action sequences then serve as input in the next
pattern discovery stage.
Figure 8.5
A screenshot of the parsing output—action list.
meaningful and relevant patterns. “Meaningful” patterns are those that are interpretable
in a particular research context. However, only some of the meaningful patterns found
are likely to be relevant to the research questions pursued by the researcher. In the field
of data mining, sequential pattern mining has long been identified as an important topic,
with a focus on developing efficient and scalable sequential pattern mining algorithms.
These algorithms, if applied directly in our scenario, are insufficient, not because of their
algorithm inefficiency, but due to their inability to introduce research contexts and domain
knowledge into the pattern discovery process. We propose two effective mechanisms to
address this problem.
Incorporate research context into sequence modeling. Sequence modeling can be
applied in various ways that depend on how the log data are aggregated. For example,
all sessions (log files) from each student in a group could be combined to create a single
sequence for that group, so that the data mining algorithm will find sequential patterns
that are common in that group’s studying activities. Alternatively, the separate sessions
(log files) from a single student could be mined to find sequential patterns that are com-
mon to that student. In short, there are various ways to model input sequences according
to researchers’ needs, and the semantics of sequential patterns is determined by the way
input sequences are modeled. The following are the common ways of modeling sequences
under different research contexts:
1. Student-based sequence modeling. All actions from a single student are taken
as a sequence of temporarily ordered actions. The discovered sequential patterns
across a group of students indicate common learning behaviors within this group.
2. Session-based sequence modeling. Given that a student engages in different ses-
sions, actions from one single session are treated as a sequence. Corresponding
sequential patterns across sessions identify the typical learning behavior of this
student. Alternatively, combined with student-based sequence modeling, session-
based modeling allows us to find common learning behaviors within a particular
session for a group of students.
3. Object-based sequence modeling. The action sequences produced from the pars-
ing process preserve the links between actions and objects. For instance, a “make a
note” action would be distinguished from a “make a glossary” action by the object
type on which an action is performed, that is, note versus glossary. With such
information, we can construct a sequence of actions performed on a particular
type of object. If we constrain input sequences by the object type, for example, note,
glossary term, concept map, or even a segment of text, the sequential patterns offer
information at a micro level. In particular, research questions could be answered
such as whether students review previous notes before making new notes, or how
frequently students revise a glossary after the glossary is first constructed.
Enforce educational domain knowledge as constraints. In a sense, all patterns are interest-
ing and meaningful as they reflect students’ real learning processes. However, not every
pattern makes a significant contribution to educational research in understanding how
individuals learn, given current development of educational theories. Constraints are thus
often used in sequential pattern mining to limit the search space to those patterns with
which researchers are likely to produce thought-provoking findings [32]. Two types of
constraints are particularly relevant in educational contexts:
one data set to discover action patterns that correspond to strategies. Sequential pattern
mining discovers patterns that may or may not be useful to the researcher. As we observed
in our previous studies [32–35], the patterns generated from mining algorithms usually
number in the hundreds, or even more. We note the following observations when massive
arrays of patterns are generated:
1. A significant portion of patterns detected offers little insight into students’ stra-
tegic intent. For example, it is very usual for students to scroll up and down to
browse the content when reading a web page. Yet the pattern consisting of a series
of “browse” actions does not, by itself, convey useful information.
2. Researchers could display interest beyond the discovered patterns. A sequen-
tial pattern captures only the sequential property of a group of students. A
sequential pattern captures only sequences. Findings about sequences could be
combined with other variables, such as individual characteristics of students, to
discover more about their learning processes. Sometimes, additional analyses
can improve the interpretation of a pattern. For example, by profiling overconfi-
dent and underconfident learners in terms of their sequential learning patterns,
additional contrasts can be framed about the learning styles of these two differ-
ent groups.
To respond the first issue, we add an ad hoc query language that allows separating irrel-
evant or nonmeaningful patterns or identifying specific patterns based on theoretical
assumptions. In the top bar of the parser output, researchers can specify the minimum
frequency, minimum length of a pattern, and/or particular actions contained in a pattern.
For example, in Figure 8.6, when the minimum frequency is set to be 22, 21211 patterns
were filtered out. If we take the length of pattern into account, the 9th and 23rd will be of
most interest to a researcher. The 9th pattern depicts a process of “browsing the content →
creating a glossary term → updating that term → continue browsing the content”, whereas
the 23rd pattern describes “browsing the content → creating a glossary term → continue
Figure 8.6
A screenshot of the sequential pattern output.
Figure 8.7
Partial sequence-pattern matrix.
browsing the content → updating that term.” These patterns of sequenced actions can then
be aggregated to infer higher-order structures, such as a surface learning approach ver-
sus a deep learning approach, or performance-oriented learning tactics versus mastery-
oriented learning tactics. The patterns in Figure 8.6 represent a deep learning approach
wherein learners create glossary terms to aid understanding and revisit them with further
reading. This method allows building up even larger strategies, based on how students’
behavior expresses motivation.
The second issue concerns moving from exploratory to confirmatory data analysis. In
the confirmatory phase, the problem is to show that the discovered action patterns are
evident in subsequent data sets. Once patterns have been identified, they can be counted
and statistically analyzed like any other psychoeducational measure. As shown in Figure
8.7, the sequence-pattern matrix indicates the occurrence of each pattern in each sequence.
In this phase, the software searches for the patterns in the current or subsequent data sets
and counts the frequency of the patterns for each learner (or other unit of observation).
The data from this phase can be exported to statistical software for further analysis. Our
next step is to implement intuitive graphic charts and tables for pattern visualization that
promote better understanding of pattern-based results.
data with techniques that are sensitive to dynamic manifestations of psychological engage-
ment over the course of engaging complex learning activities.
Many researchers adopt qualitative methods to analyze log data [37,38], whereas oth-
ers generate frequency counts of actions that learners perform [28,39,40]. However, these
approaches to analysis fail to capture either patterns of an individual’s study tactics or
navigation in hypermedia environments [41]. Tracing methodology takes promising
steps toward this goal of capturing the elements of such dynamic learning processes and
advancing educational research. Because it has the potential to capture the dynamics of the
learning process, log analysis addresses concerns that have been expressed about gather-
ing point samples using concurrent think-aloud protocols, free descriptions, or self-report
questionnaires [24,42]. Further, the application of data mining algorithms multiplies the
power to interpret these process data and broadens views about learning and other pro-
cesses that relate to learning. Thus, analyzing traces in log files and identifying patterns of
activities they contain help researchers to elaborate descriptions of learning processes by
generating a more accurate picture of what is “going on” during learning and offer educa-
tional researchers insights into students’ learning strategies and motivation, and how their
strategies and motivation change within a session or across sessions.
Acknowledgments
Support for this research was provided by grants to Philip H. Winne from the Social
Sciences and Humanities Research Council of Canada (410-2002-1787 and 512-2003-1012),
the Canada Research Chair Program, and Simon Fraser University.
References
1. Benoit, G. Data mining. Annu. Rev. Infor. Sci. Tech. 36: 265–310, 2002.
2. Wang, K., Y. Xu, and J. X. Yu. Scalable sequential pattern mining for biological sequences. In
Proceedings of the Conference on Information Knowledge Management, ed. D. A. Evans, L. Gravano,
O. Herzog, C. Zhai, and M. Ronthaler, pp. 178–187. New York: Assoc. Comput. Machinery, 2004.
3. Zaki, M. Mining data in bioinformatics. In Handbook of Data Mining, ed. N. Ye, pp. 573–596.
Mahwah, NJ: Lawrence Earlbaum Associates, 2003.
4. Zhao, Q. and S. S. Bhowmick. Sequential pattern mining: A survey. Technical Report, CAIS,
Nanyang Technological University, Singapore, No. 2003118, 2003.
5. Hu, Y. and B. Panda. A data mining approach for database intrusion detection. In Proceedings of
the 19th ACM Symposium on Applied Computing, Nicosia, Cyprus, pp. 711–716, 2004.
6. Pei, J., J. Han, B. Mortazavi-Asl, and H. Zhu. Mining access patterns efficiently from web logs.
Lect. Notes. Compt. Sci. 1805: 396–407, 2000.
7. Tsantis, L. and J. Castellani. Enhancing learning environments through solution-based knowl-
edge discovery tools: Forecasting for self-perpetuating systemic reform. J. Special. Educ. Tech.
16: 39–52, 2001.
8. Sartipi, K. and H. Safyallah. Application of execution pattern mining and concept lattice analy-
sis on software structure evaluation. In Proc. of the International Conference on Software Engineering
and Knowledge Engineering, ed. K. Zhang, G. Spanoudakis, and G. Visaggio, pp. 302–308. Skokie:
KSI press, 2006.
9. Agrawal, R. and R. Srikant. Mining sequential patterns. In Proceedings of 11th International
Conference on Data Engineering, ed. P. S. Yu and A. S. P. Chen, pp. 3–14. Washington: IEEE
Comput. Soc. Press, 1995.
10. Srikant R. and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. In Proceedings of the 5th International Conference on Extending Database Technology
(EDBT’96), Avignon, France, pp. 3–17, 1996.
11. Han, J., J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings
of the 2000 ACM-SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas,
TX, pp. 1–12, 2000.
12. Pei, J., J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining
sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 2001
International Conference on Data Engineering (ICDE’01), Heidelberg, Germany, pp. 215–224, 2001.
13. Pei, J., J. Han, B. Mortazavi-Asl, and H. Zhu. Mining access patterns efficiently from web logs.
In Proc Pacific-Asia Conf on Knowledge Discovery and Data Mining, ed. T. Terano, H. Liu, and A. L.
P. Chen, pp. 396–407. London, U.K.: Springer-Verlag, 2000.
14. Romero, C. and S. Ventura. Educational data mining: A survey from 1995 to 2005. Expert Syst.
Appl. 33: 135–146, 2007.
15. Zaïane, O. and J. Luo. Web usage mining for a better web-based learning environment. In
Proceedings of Conference on Advanced Technology for Education, Banff, Alberta, pp. 60–64, 2001.
16. Pahl, C. and C. Donnellan. Data mining technology for the evaluation of web-based teaching
and learning systems. In Proceedings of Congress E-learning, Montreal, Canada, pp. 1–7, 2003.
17. Ha, S., S. Bae, and S. Park. Web mining for distance education. In IEEE International Conference
on Management of Innovation and Technology, Singapore, pp. 715–719, 2000.
18. Wang, W., J. Weng, J. Su, and S. Tseng. Learning portfolio analysis and mining in SCORM
compliant environment. In ASEE/IEEE Frontiers in Education Conference, Savannah, Georgia,
pp. 17–24, 2004.
19. Kay, J., N. Maisonneuve, K. Yacef, and O. R. Zaiane. Mining patterns of events in students’
teamwork data. In Proceedings of Educational Data Mining Workshop, Taiwan, pp. 1–8, 2006.
20. Romero, C., S. Ventura, A. Zafra, and P. de Bra. Applying web usage mining for personalizing
hyperlinks in web-based adaptive educational systems. Comput. Educ. 53: 828–840, 2009.
Sequential Pattern Analysis of Learning Logs: Methodology and Applications 121
21. Romero, C. and S. Ventura. Data Mining in E-learning. Wit Press, Southampton, U.K., 2006.
22. Luan, J. Data mining and its applications in higher education. New Directions Institut. Res., 113:
17–36, 2002.
23. Winne, P. H. and A. F. Hadwin. nStudy: A Web Application for Researching and Promoting Self-
Regulated Learning (version 1.01) [computer program]. Simon Fraser University, Burnaby,
Canada, 2009.
24. Winne, P. H. How software technologies can improve research on learning and bolster school
reform. Educ. Psychol. 41: 5–17, 2006.
25. Nesbit, J. C. and A. F. Hadwin. Methodological issues in educational psychology. In Handbook of
Educational Psychology, ed, P. A. Alexander and P. H. Winne, pp. 825–847. Mahwah, NJ: Erlbaum,
2006.
26. Hadwin, A. F., P. H. Winne, and J. C. Nesbit. Annual review: Roles for software technologies in
advancing research and theory in educational psychology. Br. J. Educ. Psychol. 75: 1–24, 2005.
27. Hadwin, A. F., J. C. Nesbit, J. Code, D. Jamieson-Noel, and P. H. Winne. Examining trace data
to explore self-regulated learning. Metacogn. Learn. 2: 107–124, 2007.
28. Nesbit, J. C., P. H. Winne, D. Jamieson-Noel et al. Using cognitive tools in gStudy to inves-
tigate how study activities covary with achievement goals. J. Educ. Comput. Res. 35: 339–358,
2007.
29. Winne, P. H., L. Gupta, and J. C. Nesbit. Exploring individual differences in studying strategies
using graph theoretic statistics. Alberta. J. Educ. Res. 40: 177–193, 1994.
30. Baker, R. S. J. d. Data mining for education. In International Encyclopedia of Education (3rd edition),
ed. B. McGaw, P. Peterson, and E. Baker. Oxford, U.K.: Elsevier, in press.
31. Winne, P. H. Meeting challenges to researching learning from instruction by increasing the
complexity of research. In Handling Complexity in Learning Environments: Research and Theory,
ed. J. Elen and R. E. Clark, pp. 221–236. Amsterdam, the Netharlands: Pergamon, 2006.
32. Pei, J., J. Han, and W. Wang. Constraint-based sequential pattern mining: The pattern-growth
methods. J. Intell. Info. Syst. 28: 133–160, 2007.
33. Nesbit, J. C., Y. Xu, P. H. Winne, and M. Zhou. Sequential pattern analysis software for edu-
cational event data. Paper Presented at 6th International Conference on Methods and Techniques in
Behavioral Research, Maastricht, the Netherlands, 2008.
34. Zhou, M. and P. H. Winne. Differences in achievement goal-setting among high-achievers and
low-achievers. Paper Presented at the American Educational Research Association 2008 Annual
Meeting, New York, 2008.
35. Zhou, M. and P. H. Winne. Tracing motivation in multimedia learning contexts. Poster Presented
at 6th International Conference on Methods and Techniques in Behavioral Research, Maastricht, the
Netherlands, 2008.
36. Green, C. D. Of immortal mythological beasts: Operationism in psychology. Theor. Psychol. 2:
291–320, 1992.
37. MacGregor, S. K. Hypermedia navigation profiles: Cognitive characteristics and information
processing strategies. J. Educ. Comput. Res. 20: 189–206, 1999.
38. Schroeder, E. E. and B. L. Grabowski. Patterns of exploration and learning with hypermedia.
J. Educ. Comput. Res. 13: 313–335, 1995.
39. Lawless, K. A. and J. M. Kulikowich. Understanding hypertext navigation through cluster anal-
ysis. J. Educ. Comput. Res. 14: 385–399, 1996.
40. Barab, S. A., M. F. Bowdish, M. Young, and S. V. Owen. Understanding kiosk navigation: Using
log files to capture hypermedia searches. Instr. Sci. 24: 377–395, 1996.
41. Barab, S. A., B. R. Fajen, J. M. Kulikowich, and M. Young. Assessing hypermedia navigation
through pathfinder: Prospects and limitations. J. Educ. Comput. Res. 15: 185–205, 1996.
42. Winne, P. H., D. L. Jamieson-Noel, and K. Muis. Methodological issues and advances in research-
ing tactics, strategies, and self-regulated learning. In Advances in Motivation and Achievement:
New Directions in Measures and Methods, ed. P. R. Pintrich and M. L. Maehr, pp. 121–155.
Greenwich, CT: JAI, 2002.
9
Process Mining from Educational Data
Contents
9.1 Introduction......................................................................................................................... 123
9.2 Process Mining and ProM Framework............................................................................ 125
9.3 Process Mining Educational Data Set.............................................................................. 126
9.3.1 Data Preparation..................................................................................................... 127
9.3.2 Visual Mining with Dotted Chart Analysis....................................................... 128
9.3.3 Conformance Analysis........................................................................................... 131
9.3.3.1 Conformance Checking.......................................................................... 132
9.3.3.2 LTL Analysis............................................................................................. 135
9.3.3.3 Process Discovery with Fuzzy Miner................................................... 136
9.4 Discussion and Further Work........................................................................................... 138
Acknowledgments....................................................................................................................... 141
References...................................................................................................................................... 141
9.1╇ Introduction
In modern education, various information systems are used to support educational pro-
cesses. In the majority of cases, these systems have logging capabilities to audit and moni-
tor the processes they support. At the level of a university, administrative information
systems collect information about students, their enrollment in particular programs and
courses, and performance like examination grades. In addition, the information about the
lectures, instructors, study programs, courses, and prerequisites are typically available
as well. These data can be analyzed from various levels and perspectives, showing dif-
ferent aspects of organization, and giving us more insight into the overall educational
system. From the level of an individual course, we can consider participation in lectures,
accomplishing assignments, and enrolling in midterm and final exams. However, with the
development and increasing popularity of blended learning and e-learning, information
systems enable us to capture activities also at different levels of granularity. Besides more
traditional tasks like overall student performance or dropout prediction [1], it becomes
possible to track how different learning resources (videolectures, handouts, wikis, hyper-
media, quizzes) are used [2], how students progress with (software) project assignments
(svn commits) [3], and self-assessment test and questionnaires [4].
More recently, traditional data-mining techniques have been extensively applied to find
interesting patterns, build descriptive and predictive models from large volumes of data
accumulated through the use of different information systems [5,6]. The results of data
123
124 Handbook of Educational Data Mining
Support/
“World” controls
Students Information
Exams
Courses system
Teachers
Lectures Records
Specifies
events, e.g.,
Configures
messages,
Models Implements
transactions,
analyzes Analyzes
etc.
Discovery
Process mining
FIGURE 9.1
Process mining concepts.
mining can be used for getting a better understanding of the underlying �educational
�processes, for generating recommendations and advices to students, for improving
resource management, etc. However, most of the traditional data-mining techniques do
not focus on the process as a whole. They do not aim at discovering or analyzing the com-
plete educational process, and it is, e.g., not clear how, given a study curriculum, we could
check automatically whether the students always follow it. It is also not possible to have
a clear visual representation of the whole process. To allow for these types of analysis (in
which the process plays the central role), a new line of data-mining research, called process
mining, has been invented [9].
Process mining has emerged from the business community. It focuses on the devel-
opment of a set of intelligent tools and techniques aimed at extracting process-related
knowledge from event logs* recorded by an information system. The complete overview
of process mining is illustrated in Figure 9.1. The three major types of process mining
applications are
1. Conformance checking: reflecting on the observed reality, i.e., checking whether the
modeled behavior matches the observed behavior.
2. Process model discovery: constructing complete and compact process models able to
reproduce the observed behavior.
3. Process model extension: projection of information extracted from the logs onto the
model, to make the tacit knowledge explicit and facilitate better understanding of
the process model.
Process mining is supported by the powerful open-source framework ProM [7]. This frame-
work includes a vast number of different techniques for process discovery, conformance
* Typical examples of event logs may include resource usage and activity logs in an e-learning environment, an
intelligent tutoring system, or an educational adaptive hypermedia system.
Process Mining from Educational Data 125
analysis and model extension, as well as many other tools like convertors, visualizers, etc.
The tool allowed for a wide use of process mining in industry.
In this chapter, we exemplify the applicability of process mining, and the ProM frame-
work in particular, for educational data-mining context. We discuss some of their potential
for extracting knowledge from a particular type of an educational information system,
considering an (oversimplified) educational processes reflecting students behavior only
in terms of their examination traces. We focus on process model conformance and process
model discovery, and do not consider model extensions.
The structure of the chapter is as follows. In the next section, we explain process min-
ing in more detail and we present the ProM framework. Then, we discuss how we applied
some process mining techniques on our educational data set, establishing some useful
results. Finally, the last section is for discussions.
Mining
plug-in
User
interface
+ Result
User frame Conversion
interaction plug-in
Analysis
Visualization
plug-in
engine
FIGURE 9.2
ProM framework. (From van Dongen, B.F. et al., The ProM framework: A new era in process mining tool sup-
port, in Proceedings of the ICATPN 2005, Ciardo, G. and Darondeau, P. (eds.), LNCS 3536, Springer, Heidelberg,
Germany, pp. 444–454, 2005. With permission.)
formats, for example, from EPCs to Petri nets. The analysis plug-ins include various analy-
sis methods, checking both qualitative and quantitative properties of models, logs, or com-
bined. Conformance checkers, e.g., are implemented as analysis plug-ins requiring both a
model and a log. In addition to this, ProM has a visualization engine to display logs and
different types of (discovered) process models, and comes with ProMimport, a separate
tool that converts database data to ProM-recognizable MXML format.
Figure 9.3 shows the snapshot of ProM 5.0 screen with a few plug-ins running.
It is not reasonable to give a complete overview of all of ProM plug-ins here (there are
around 250 of them at this moment). We, therefore, just mention those that we think are the
most relevant plug-ins in the context of educational data mining. These include (1) plug-ins for
importing the data from a database to MXML file (ProM import); (2) plug-ins for log inspec-
tion and filtering; (3) conformance analysis plug-ins, in particular those that compare Petri
nets with logs, and those based on linear temporal logic (LTL) model checking; (4) process
model discovery with Heuristic and Fuzzy Miner [10]; and (5) Dot Chart visualization [11].
How we applied these plug-ins on our educational data set is the subject of the next section.
FIGURE 9.3
ProM screenshot with some plug-ins executing.
mining techniques. We first describe some of the important data preparation steps, and
then we consider examples of visual data exploration, conformance checking, and process
discovery.
FIGURE 9.4
Database schema for preparing data to be converted to MXML format.
be filled with data about tasks that have been performed during the execution of a process
instance (WFMElt contains the name of the task, EventType the task event type, Timestamp
the time in which the task changed its state, and Originator the person or system that
caused the change in the task state, and ATE-ID the primary key); and (4) Data_Attributes_
Audit_Trail_Entries table, which needs to be filled with additional information about each
audit trail entry (if such information exists).
Note that it is not required that systems log all of this information; e.g., some systems do
not record transactional information, related data, or timestamps. In the MXML format,
only the ProcessInstance (i.e., case) field and the WorkflowModelElement (i.e., activity) field are
obligatory; all other fields (data, timestamps, resources, etc.) are optional.
Having all the data filled into these tables, ProM import automatically transforms the
database to an MXML file.*
In our case, PI-ID corresponds to a student, WFMElt to a course identified for which
an examination took place at time recorded in Timestamp. Additionally, we can pre-
serve information about the grades, intermediate and final graduation, and semester
identifies.
Figure 9.5 illustrates a log example composed of process instances (i.e., students in our
case) with corresponding audit trail entries or events (i.e., examinations in our case) with
various attributes (grade for the exam, attempt, etc.).
* We omit here many details of data cleaning and data selection steps, which are common for a data mining
process.
Process Mining from Educational Data 129
FIGURE 9.5
Fragment of the MXML log containing data about students’ examinations.
The dotted chart analysis plug-in of ProM is fully configurable (see Figures 9.6 and 9.7).
It is always possible to choose which components are considered, enabling us to quickly
focus on a particular aspect of the event log. Multiple time options are offered to deter-
mine the position of the event in the horizontal time dimension. For example, time can be
set to be actual (the time when the event actually happened is used to position the corre-
sponding dot), or relative (the first event of an instance is time-stamped to zero). Sorting
of the vertical axes can be done in multiple ways, e.g., based on instance durations or the
number of events. Dot coloring is flexible.
The dotted chart analysis can show some interesting patterns present in the event log.
For example, if the instance is used as the first component type, the spread of events within
each instance can be identified. The users can easily identify which instance takes longer,
which instances have many events, etc. We now show how this type of analysis is useful
for our data set.
130 Handbook of Educational Data Mining
(a)
(b)
FIGURE 9.6
DCA of students (who started in the same year) performance; instances (one per student) are sorted by duration
(of their studies): (a) all the data and (b) the zoomed region.
Process Mining from Educational Data 131
FIGURE 9.7
DCA of four different “generations” of students started at different years.
Recall that in our case, the events in the log represent exams, different log instances cor-
respond to different students, tasks are course names, and originators correspond to the
obtained grades.
Figure 9.6 illustrated the output of the dot chart analysis of the subset of students who
started their studies in the same year. All the instances (one per student) are sorted by the
duration of studies.
A point in black color denotes the situation when a student receives a grade for the par-
ticular examination that is not enough to pass the course. The white color denotes grades
that are close to borderline but good enough to pass, and the grey color denotes really
good grades. We can clearly see from Figure 9.6a that the students who studied longer had
(on average) lower grades.
The zoomed region (Figure 9.6b) allows us to see vertical black and white “lines” indicat-
ing that some exams were correspondingly hard or easy, and what is important to note,
not much discriminating.
Figure 9.7 shows four consequent generations of students. From this figure we are, e.g.,
able to compare the number of students started in each year, or to find which generation
had more good or bad grades in certain study phases.
typically take, what are the choices they make, nor whether the order of exams always
agreed with the prerequisites specified in the curriculum. For this, we need some more
advanced techniques.
The purpose of conformance analysis is to find out whether the information in the log is
as specified by the process model, a set of constraints in the study curricular in our case.
As we discussed in the introduction, this analysis may be used to detect deviations, to
locate and explain these deviations, and to measure the severity of these deviations. ProM
supports various types of conformance analysis; for this chapter, we use the Conformance
Checker and LTL Checker plug-ins.
Conformance checker requires, in addition to an event log, some a priori model. This
model may be handcrafted or obtained through process discovery. Whatever its source,
ProM provides various ways of checking whether reality (information in the log) conforms
to such a model. On the other side, the LTL Checker does not compare a model with the
log, but a set of requirements described by the (linear) temporal logic LTL.
events that could not be replayed correctly and corresponds to the failed tasks in the
model view.
Precision analysis deals with the detection of “extra behavior,” discovering for example
alternative branches that were never used when executing the process. The precision is
100% if the model “precisely” allows for the behavior observed in the log.
The simple behavioral appropriateness, which is based on the mean number of enabled
transitions during log replay (the greater the value the less behavior is allowed by the
process model and the more precisely the behavior observed in the log is captured), and
the advanced behavioral appropriateness, which is based on the detection of model flexibility
(that is, alternative or parallel behavior) that was not used during real executions observed
in the log, are used to measure the degree of precision (or behavioral appropriateness).
Furthermore, there are a number of options, which can be used to enhance the visualiza-
tion of the process model; by the indication of always/never precedes and always/never follows
options, we can highlight corresponding activities.
Besides fitness and precision, the structural properties of the model and its semantics
can be analyzed. In a process model, structure is the syntactic means by which behavior
(i.e., the semantics) can be specified, using the vocabulary of the modeling language (for
example, routing nodes such as AND or exclusive OR). However, often there are several
syntactic ways to express the same behavior, and there may be representations easier or
harder to understand. Clearly, this evaluation dimension highly depends on the process
modeling formalism and is difficult to assess in an objective way.
We now show how conformance checking can be used to check whether the students
follow the study curriculum. The idea is to construct a Petri net model of (a small part of)
the curriculum, and then fit it into the plug-in to look for discrepancies.
A study curriculum typically describes different possibilities students have and con-
tains different types of constrains that students are expected to obey. The most popular
constraints perhaps are the course prerequisites, i.e., we may want to prohibit a student to
take a more advanced course before an introductory course is passed. Figure 9.8 shows a
very small part of the study curriculum in the form of the Petri net (drawn also with the
ProM plug-in), representing a constraint that each student has to take at least two courses
from 2Y420, 2F725, and 2IH20. In general, any m-out-of-n items constraint can be expressed
with a similar approach.
Figure 9.9 gives the output of the conformance checking (the Model view) of the cor-
responding Petri nets from Figure 9.8. We can see from the figure the places in the model
where problems occurred during the log replay and many interesting characteristics,
including path coverage, passed edges, failed and remaining tasks, and token counters. In
(a) (b)
FIGURE 9.8
Petri nets representing a small part of the study curricular: (a) 2 out of 3 constraint and (b) 2 out of 3 constraint
with prerequisite constraint.
134 Handbook of Educational Data Mining
2 2
tr15
13
+1
2 2
17
2F725 (J)
+1 13 tr12
13
13
33 –12
1 1
tr11
41 41 1 17
+2 2IH20 (J) +10
17 17
45
45 45 tr22
tr1 +11
17 17
45
1 1
tr15
13
+2
17 1 1
2F725 (J)
+3 13 tr12
13
13
33 –14
1 1
tr11
+3 41 41 1 16
–12 2IH20 (J) +11
16
33 17
+1 33
–2 2Y420 (J)
1
(b)
FIGURE 9.9
Conformance analysis of the Petri nets from Figure 9.8—model view: (a) 2 out of 3 constraint and (b) 2 out of 3
constraint with the prerequisite.
Figure 9.9b, we see that the prerequisite constraint that 2F715 should be taken (and passed)
before 2 out of 3 choice is to be made was violated quite often. The course 2IH20 was passed
in 12 cases (2F725 once and 2Y420 twice) before the prerequisite 2F715. Figure 9.9a on the
other hand indicates that there are no problems in the log (the 2 out of 3 constraint was
always satisfied).
Figure 9.10b shows the log view of the result where, for the case with the prerequisite
constrains, we can see some log replay problems (marked here as dark rectangles) occurred
Process Mining from Educational Data 135
(a)
(b)
FIGURE 9.10
Conformance analysis of the Petri nets from Figure 9.8—log view: (a) two courses out of three constraint and (b)
two courses out of three constraint with prerequisite constraint.
in the log (instances 2, 6, 8, and 12). Figure 9.10a shows no “dark” tasks as the log complies
with the first model.
The LTL Checker plug-in is a log-analysis plug-in that allows us to check whether the log
satisfies a given set of properties. This plug-in, thus, does not need a model to run, but only
requires the user to define the desired properties. These properties, which in our case should
reflect the requirements of the curriculum, are described in (an extension of) the temporal
logic LTL. In the context of a generic log, this powerful and very expressive logic can specify
things like “if task A happens, then task B will happen within 5 time units,” “Task A always
eventually performs,” or “Person A never performs task B.” As we now show, using this
logic we can also verify many interesting properties of our educational data set.
We use the LTL Checker to check (1) if there are any students that took more than three
exams on one day; (2) if the rule “Logic 1 must be passed before the exam for Logic 2 can be
taken” was always respected (prerequisite check); (3) if there is a course A that is a prereq-
uisite for course B such that many students have got a low grade in A but a high grade in
B; and (4) to identify the students that have passed some course twice, second time with a
lower/higher grade. All these properties can be easily coded using the LTL language of the
plug-in; an excerpt of the code file is given in Figure 9.11. This file can be imported into the
user interface of the plug-in, which shows a description of a selected property and allows
the user to add the input parameters (in our case, the courses).
Figure 9.12a shows the result when the first property is checked. As it is impossible
to take more than three exams per day, a violation of the first property could indicate a
flaw in the way the data has been entered into the database. The result shows that there
are 38 students (i.e., correct case instances) that satisfy this property, and 11 students that
do not (incorrect case instances). We can select each student now, and find the day when he
or she took these exams; one such situation is highlighted in Figure 9.12a.
Figure 9.12b shows that all students but one have been respecting the curriculum, and
have taken, e.g., Logic 2 only after they have passed Logic 1. Note that we could potentially
incorporate all prerequisites from the curriculum in this formula.
The third property is meant to be used to check whether some prerequisites make sense,
and to possibly help in future adaptations of the curriculum. The case when many students
are able to pass the course B with a high grade, even though they almost failed the course A,
may indicate that B does not really need A as a prerequisite. Of course, if B is just a continua-
tion of A, then this problem could indicate a significant difference between the requirements
for these courses. In either case, normalization issues may need to be taken into account.
The fourth property simply checks whether someone was trying to pass an exam that
he or she has passed already. Finally, the last property can be used to advise students to
either try or not try with another exam if they are not satisfied with their current grade
(or in a different setting to see whether many student are willing to improve their grade).
Note that the LTL Checker plug-in can also be used to filter out some unwanted instances
(i.e., the students) from the log. We have indeed used this feature for the dotted chart
analysis, where we selected the “good” students as log instances that do not satisfy the
LTL formula.
formula exists _ student _ taking _ more _ than _ three _ exams _ per _ day():=
{<h2>Is there a student taking more than three exams per day?</h2>}
exists[d: day |
<> (â•›(d == day /\ _ O (â•›(d == day /\ _ O (d == day)â•›)â•›)â•›)â•›)];
formula course _ c1 _ always _ before _ c2(c1: course, c2:course):=
{<h2>Prerequisite check: Is the course c1 always taken before the course c2?</h2>}
(<> (course == c2)
->
! (â•›(course != c1 _ U course == c2)â•›)
);
formula prerequisite _ with _ low _ grade _ but _ course _ with _ high _ grade(c1:course,
c2:course):=
{<h2>Can a low grade for a prerequisite follow a high grade for the course?</h2>}
(course _ c1 _ always _ before _ c2(c1,c2)
/\
<> (â•›(â•›(c1 == course /\ grade <= 60)
/\
<> (â•›(c2 == course /\ grade >= 80)â•›)
)
)
);
formula exams _ passed _ twice():=
{<h2>Can an exam be passed twice?</h2>}
exists[c: course | <> (â•›(â•›(c == course /\ grade >= 60)
/\ _ O (<> (â•›(c == course)â•›)â•›)â•›)â•›)];
formula exams _ passed _ twice _ second _ grade _ higher():=
{<h2>Exams that are passed twice and the second time the grade was higher?</h2>}
exists[c: course | (â•›(<> (â•›(â•›(c == course /\ grade == 60)
/\ _ O (<> (â•›(c == course /\ grade > 60)â•›)â•›)â•›)â•›)
\/
<> (â•›(â•›(c == course /\ grade == 70)
/\ _ O (<> (â•›(c == course /\ grade > 70)â•›)â•›)â•›)â•›)â•›)
\/
(<> (â•›(â•›(c == course /\ grade == 80)
/\ _ O (<> (â•›(c == course /\ grade > 80)â•›)â•›)â•›)â•›)
\/
<> (â•›(â•›(c == course /\ grade == 90)
/\ _ O (<> (â•›(c == course /\ grade > 90)â•›)â•›)â•›)â•›)â•›)];
FIGURE 9.11
Conformance analysis of constraints expressed with LTL.
algorithms to achieve so. As our log is expected to be highly unstructured, and to contain
a lot of rare and uninteresting behavior, we decided to demonstrate the use of the Fuzzy
Miner plug-in. This, fully parameterizable, plug-in has capabilities to abstract from, or to
aggregate, less significant behavior, focusing only on the important parts. In this way, we
are able to avoid the unreadable, “spaghetti-like,” process models that would result from
using some other (standard) discovery plug-ins. Of course, if the visual representation of
the process is not an ultimate goal (if the model is, e.g., not to be read by humans), other
plug-ins would apply here as well.
Mining with Fuzzy Miner is an interactive and explorative process, where the user con-
figures the parameters until the desired level of abstraction is reached. There are several
metrics that can be taken into account for mining, like frequency, routing, distance, or
data, each with a desired strength.
The graph notation used is fairly straightforward. Square nodes represent event classes,
with their significance (maximal value is 1.0) provided below the event class name. Less
138 Handbook of Educational Data Mining
(a) (b)
FIGURE 9.12
Output of the LTL checker: (a) not more than three exams per day and (b) a course prerequisite check.
significant and lowly correlated behavior is discarded from the process model, i.e., nodes
and arcs that fall into this category are removed from the graph. Coherent groups of less
significant but highly correlated behavior are represented in aggregated form as clusters.
Cluster nodes are represented as octagons, displaying the mean significance of the clus-
tered elements and their amount. The internal components of clusters and their structure
can be explored by clicking on the corresponding cluster nodes. Links, or arcs, drawn
between nodes are decorated with the significance and correlation represented by each
relation. Additionally, arcs are colored in a gray shade; the lower the significance of the
relation the lighter the gray.
We used Fuzzy Miner in a very modest way and included only the metric related to the
total frequency of a task, i.e., in our case to the “popularity” of a course. Figure 9.13 shows
two discovered models of the curriculum with different threshold values that determine
the level of abstraction (by grouping together the courses that were infrequently taken and
thus focusing only on the important parts).
The semantics of Fuzzy Miner models is similar to the semantics of Petri nets. So, the
models from Figure 9.13 allow us to see all the paths the students took and the choices they
made (with corresponding frequencies).
One nice feature of the Fuzzy Miner plug-in is its ability to show the animation of
instances (a snapshot is given in the right upper part of Figure 9.3) flowing through the
process model (either student by student or all at once). The domain experts say that this is
one of the best ways to illustrate the most common paths.
2M104
J
1.000
0.724
0.201
5A015
J
0.610
0.679
0.284
2Y345
J
0.457
0.589
0.324
2M004
J
0.752
1.000
0.981
2M008
J
0.501
1.000
0.924
2R000
J
0.733
0.780 0.069
0.310 0.574
5A043
J
0.400
0.691
0.362
2Y340 2M108
J J
0.387 0.207
0.590 0.126
0.396 0.287
2L441
J
0.596
0.019 0.504
0.447 0.232
5a050 0.045
0.707 J
0.215 0.333
0.382
0.452 0.082
0.323 0.266
1B170
J 0.457
0.547 0.366
0.877
0.321
21A10
J
0.577
0.756
0.401
2F715
0.573 J
0.505 0.401
0.045 0.640
0.758 0.362
21F00
J
0.536
0.589
0.428
2Y380
0.560 J
0.198 0.431
0.673
0.255
5B043
J
0.497
0.875
0.234
21H20
J
0.564
0.811
0.410
2M210
J
0.567
0.807
0.241 0.518
0.349
2WS04 2M830
J J
0.277 0.528
0.199 0.560
0.300 0.411
2Y420
J 0.550
0.438 0.411
0.563
0.318
2M054
0.172 J
0.275 0.535
0.558 0.339
0.390 0.204
21A50
J
0.535
1Z300
0.136 J
0.331 0.295
2M927 2F725
J J
0.449 0.204
0.584 0.083
0.417 0.326
2M090
J
0.460
0.465 0.028
0.337 0.479
2N477
J
0.365
0.601
0.442
2R707
J
0.351
0.414 0.098
0.378 0.447
2R110
J
0.395
0.465 0.030
0.411 0.670
2M400
J
0.461
0.489 0.073
0.514 0.747
2M350
0.299 J
0.405 0.335
2M140
J 0.518
0.464 0.268
0.920
1.000 0.074
0.373
2M390
J 0.062
0.574 0.383
0.003 0.512
0.574 0.325
1J210
J
0.425
2R460
J 0.134
0.208 0.424
0.190
0.065 0.114 0.351 0.244 0.306
0.191 0.450 0.263 0.309
21N10 0P610
J J 0.577
0.175 0.179 0.504
21N40 59 9L973
J J J
0.237 0.575 0.373
58 64 1Z060 2M240
0.226 J J 0.158 J J
0.627 0.587 0.575 0.479 0.235 0.292
0.187 0.258 0.102 0.102 0.175 0.052 0.052 0.049 0.057 0.548
0.396 0.360 0.191 0.000 0.328 0.000 0.287 0.479 0.469 0.267
60 68 1Z340 63 81 2F540
J J J J J J 0.124
0.575 0.575 0.261 0.563 0.563 0.527 0.479
2S520 0.024 70
0.052 0.125 J J 0.007
0.000 0.320 0.191 0.287
0.329 0.599
78 2L490
J J 0.543
0.575 0.338 0.401
0.086
0.827 0.072 0.020 0.062
0.190
0.096 0.000 0.479 0.339
62 48 52 2R077 21V35
J J J J J
0.587 0.563 0.563 0.137 0.247
0.070 0.072
0.072 0.078 0.137 0.075 0.096
0.000 0.474 0.574 0.096 0.096
2M150 2R080 61 55 75
J 0.101 J J J J
0.443 0.000 0.545 0.587 0.563 0.575
66 56 67 0L951
J J J J
0.575 0.575 0.599 0.399
0.474
0.254
2M317
J
0.608
(a)
0.013
0.002 0.287
0.287
77 53
J J
0.575 0.563
FIGURE 9.13
The outputs of the Fuzzy Miner plug-in with two different levels of abstraction.
(continued)
140 Handbook of Educational Data Mining
Cluster 87
2 primitives
~ 0.354
1.000
0.924
2R000
J
0.733
0.126 0.780
0.287 0.310
Cluster 101
4 primitives
~ 0.414
0.590
0.396
2L441
J
0.596
0.707
0.215
1B170
J 0.589
0.547 0.428
0.877
0.321 0.673
0.255
21A10
0.504 J
0.232 0.577
0.573
0.452 0.505
0.323
0.875 0.045
0.234 0.758
21H20 0.098
J 0.447
0.564
0.811
0.410
2M210
J
0.567
0.807
0.518
2M830
0.019 J
0.447 0.528
0.045 0.550
0.333 0.411
2M104 2M054
J 0.241
0.349 J
1.000 0.535
0.339
0.724 0.558 0.204
0.201 0.390
2M004 Cluster 93
J 3 primitives
0.752 ~ 0.355
61 55 75 2R080
0.172 0.052 0.512 J 0.074 J 0.002 0.005 0.049 0.068 0.102
0.275 0.325 J 0.373 J 0.479 0.235 0.191
0.000 0.587 0.563 0.575 0.191 0.191
0.545
0.062 0.205 0.664
0.383 0.622 0.235
0.070 2M317
0.574 J
0.608
0.577 0.013
0.504 0.287
Cluster 88 53 0.002
22 primitives J 0.287
~ 0.319 0.563
0.027 0.085 0.072 0.306 0.920 0.088 0.024 0.118 0.059 0.198 0.162 0.016 0.017 0.101 0.057 0.065 0.007
0.096 0.096 0.000 0.309 1.000 0.096 0.191 0.191 0.000 0.000 0.223 0.191 0.168 0.000 0.469 0.191 0.479
(b)
48 56 52 2M390 64 78 59 86 58 72 66 67 77 70 62 60
J J J J J J J J J J J J J J J J
0.563 0.575 0.563 0.574 0.575 0.575 0.575 0.563 0.587 0.599 0.575 0.599 0.575 0.599 0.587 0.575
helped to address many issues with use of traditional classification, clustering, and asso-
ciation analysis techniques.
Process perspective in educational domains has also drawn attention of several research-
ers; however, most of the traditional intelligent data analysis approaches discussed so far
in the context of educational data mining rarely focus on a process as a whole.
In this chapter, we presented a comprehensive introduction into process mining frame-
work and ProM tool and we discussed some of its potential for extracting knowledge
from a particular type of the educational information system. We considered only a sim-
plified educational processes reflecting students behavior in terms of their examination
Process Mining from Educational Data 141
traces consisting of a sequence of course, grade, and timestamp triplets for each student.
However, we hope that a reader can see how the same technique can be applied to many
other types of datasets including learning resources usage logs, various interaction logs
(e.g., with an intelligent tutoring system), etc. For example, in [4] we illustrated some of
the potential of process mining techniques applied to online assessment data, where stu-
dents in one of the tests were able to receive tailored immediate elaborative feedback after
answering each of the questions in the test one by one in a strict order, and in the other test
to receive no feedback but to answer question in a flexible order.
ProM 5.0 provides a plugable environment for process mining, offering a wide variety of
plug-ins for process discovery, conformance checking, model extension, and model trans-
formation. Our further work includes the development of EDM-tailored ProM plug-ins,
which, from the one hand, would help to bring process mining tools closer to the domain
experts (i.e., educational specialists and researchers who not necessarily have all the tech-
nical background), helping them analyze educational processes in a principled way based
on formal modeling, and from the other hand would allow to better address some of the
EDM-specific challenges related to data preprocessing, namely, dealing with duplicate
events, “synonyms,” parallelism, pseudo-dependencies, relatively large diversity, and
small sample. One particular focus is on integrating domain knowledge into process min-
ing. Some preliminary work in this direction can be found in [13], where we introduced
a new domain-driven framework for educational process mining, which assumes that a
set of pattern templates can be predefined to focus the mining in a desired way and make it
more effective and efficient.
Acknowledgments
This work is supported by NWO. We would like to thank STU for providing the data, and
the many people involved in the development of ProM.
References
1. Dekker, G., Pechenizkiy M., and Vleeshouwers, J., Predicting students drop out: A case study,
in Proceedings of the 2nd International Conference on Educational Data Mining (EDM’09), Cordoba,
Spain, pp. 41–50, July 1–3, 2009.
2. Zafra, A. and Ventura, S., Predicting student grades in learning management systems with
multiple instance genetic programming, in Proceedings of the 2nd International Conference on
Educational Data Mining (EDM’09), Cordoba, Spain, pp. 309–318, July 1–3, 2009.
3. Perera, D., Kay, J., Koprinska, I., Yacef, K., and Zaïane, O.R., Clustering and sequential pattern
mining of online collaborative learning data. IEEE Trans. Knowl. Data Eng. 21(6), 759–772, 2009.
4. Pechenizkiy, M., Trcka, N., Vasilyeva, E., van der Aalst, W., and De Bra, P., Process mining the
student assessment data, in Proceedings of the 2nd International Conference on Educational Data
Mining (EDM’09), Cordoba, Spain, pp. 279–288, July 1–3, 2009.
5. Romero, C., Ventura, S., and García, E., Data mining in course management systems: MOODLE
case study and tutorial. Comput. Educ. 51, 368–384, 2008.
6. Romero, C. and S. Ventura, S., Educational data mining: A survey from 1995 to 2005. Expert Syst.
Appl. 33(1), 135–146, 2007.
142 Handbook of Educational Data Mining
7. van Dongen, B.F., de Medeiros, A.K.A., Verbeek, H.M.W., Weijters, A.J.M.M., and van der Aalst,
W.M.P., The ProM framework: A new era in process mining tool support, in Proceedings of the
ICATPN 2005, Ciardo, G., Darondeau, P., Eds., LNCS 3536, Springer, Heidelberg, Germany,
2005, pp. 444–454.
8. Rozinat, A. and van der Aalst, W.M.P., Conformance checking of processes based on monitor-
ing real behavior. Inform. Syst. 33(1), 64–95, 2008.
9. van der Aalst, W.M.P., Weijters, A.J.M.M., and Maruster, L., Workflow mining: Discovering
process models from event logs. IEEE Trans. Knowl. Data Eng. 16(9), 1128–1142, 2004.
10. Günther, C.W. and van der Aalst, W.M.P., Fuzzy mining: Adaptive process simplification based
on multi-perspective metrics, in Proceedings of the International Conference on Business Process
Management (BPM 2007), Alonso, G., Dadam, P., and Rosemann, M., Eds., LNCS 4714, Springer-
Verlag, Berlin, Germany, 2007, pp. 328–343.
11. Song, M. and van der Aalst, W.M.P., Supporting process mining by showing events at a glance,
in Proceedings of the 7th Annual Workshop on Information Technologies and Systems (WITS’07), Chari,
K. and Kumar, A., Eds., Montreal, Canada, 2007, pp. 139–145.
12. Günther, C.W. and van der Aalst, W.M.P., A generic import framework for process event logs,
in Proceedings of the Business Process Management Workshop, 2006, pp. 81–92 (see also http://
promimport.sourceforge.net/).
13. Trcka, N. and Pechenizkiy, M., From local patterns to global models: Towards domain driven
educational process mining, in Proceedings of the 9th International Conference on Intelligent Systems
Design and Applications (ISDA’09), Pisa, Italy, IEEE CS Press, November 30–December 2, 2009.
10
Modeling Hierarchy and Dependence among
Task Responses in Educational Data Mining
Brian W. Junker
Contents
10.1 Introduction......................................................................................................................... 143
10.2 Dependence between Task Responses............................................................................ 144
10.2.1 Conditional Independence and Marginal Dependence.................................... 144
10.2.2 Nuisance Dependence............................................................................................ 148
10.2.3 Aggregation Dependence...................................................................................... 150
10.3 Hierarchy............................................................................................................................. 150
10.3.1 Hierarchy of Knowledge Structure...................................................................... 151
10.3.2 Hierarchy of Institutional/Social Structure........................................................ 151
10.4 Conclusions.......................................................................................................................... 152
Acknowledgments....................................................................................................................... 153
References...................................................................................................................................... 153
10.1╇ Introduction
Modern educational systems, especially those mediated by human–computer interaction
systems such as web-based courses, learning content management systems, and adaptive
and intelligent educational systems, make large data streams available by recording stu-
dent actions at microgenetic [25, p. 100] timescales* for days, weeks, and months at a time.
Educational data mining (EDM) [31] aims to exploit these data streams to provide feedback
and guidance to students and teachers and to provide data for educational researchers to
better understand the nature of teaching and learning.
Most empirical problems in EDM can be decomposed into three types: making infer-
ences about the characteristics of tasks (e.g., does the task measure what we want? does it
do so well or poorly? for which students? etc.); making inferences about the characteristics
of students (e.g., what skills, proficiencies, or other knowledge components (KCs) do they
possess? does an intervention improve performance?); and making predictions about stu-
dents’ performance on future tasks (e.g., which students will perform well in a downstream
assessment?). Unfortunately, progress on any of these inferential problems is hampered by
two sources of statistical dependence. First, different student actions may not provide inde-
pendent pieces of information about that individual student. For example, if the student
* That is, recording students’ actions at high frequency as they are learning rather than taking a single cross-
sectional measurement or infrequent longitudinal measurements.
143
144 Handbook of Educational Data Mining
correctly calculates the area from the lengths of the two sides of the rectangle, we may get
little additional independent information from another multiplication task. Second, there
may be dependence among multiple responses (from the same or different students), due
to common cognitive, social, or institutional contexts. The two kinds of dependence may be
combined hierarchically in building models for educational data, and each kind of depen-
dence should be accounted for in making any kind of inferences in EDM.
In this chapter, I illustrate the consequences of failing to account for these sources of
dependence in making inferences about task and student characteristics, using examples
from the literature. In Section 10.2.1, I have expressed the cost in terms of the number of
additional tasks that students would need to perform, in order to achieve the precision
promised by an excessively simple model. In Section 10.3.1, I have highlighted a substan-
tial classification bias that arises from the use of a model that ignores hierarchy. In Section
10.3.2, I consider an example in which ignoring dependence produces a false statistical
significance for the effect of an intervention. I am not aware of correspondingly careful
studies of predictive inference using under-specified models. In situations in which the
predictive criterion is itself less complex (e.g., [3] and [18]), simpler models that ignore
dependence and hierarchy can be quite successful. When models in EDM are expected to
predict fine-grained outcomes, or make precise and accurate predictions over a range of
outcomes, I expect that accounting for dependence and hierarchy will be as important for
prediction as it already is for other EDM inferential tasks.
1. Of the 640 students in a school, 428 were born in Massachusetts. If a newspaper reporter interviews
one student at random, which is the best estimate of the probability that the student was born in
Massachusetts?
2. A bag contains three red, two green, and four blue balls. John is going to draw out a ball without
looking in the bag. What is the probability that he will draw either a green or a blue ball?
3. A bag contains three blue, four red, and two white marbles. Karin is going to draw out a marble
without looking in the bag. What is the probability that she will not draw a red marble?
Figure 10.1
Three questions encountered in the Assistments online tutoring system. (From Worcester Polytechnic
Institute, Assistment: Assisting Students with their Assessments, 2009. Retrieved January 2009 from http://www.
assistment.org/. With permission.)
Modeling Hierarchy and Dependence among Task Responses in Educational Data Mining 145
who correctly answers question #2 in Figure 10.1 is much more likely to get question #3
correct, and somewhat more likely to get question #1 correct. This is an example of non-
independence between student responses. It is not certain, however, that the student will
get either problem #3 or problem #1 correct; the dependence between responses is clearly
probabilistic, not deterministic.
Let yij denote the scored response or performance of the ith student (iâ•›=â•›1,…, N) to the jth
question or task (jâ•›=â•›1,…, J). Because the outcome yij is probabilistic, we specify a stochastic
model for it,
(
P yij = y = f y θi , β j , ψ , ) (10.1)
where
y is a possible score or coding of the student response
θi is a parameter (or parameters) related to the student
βj is a parameter (or parameters) related to the task
ψ is whatever parameters remain (here and throughout, we use ψ to represent incidental
parameters in any part of the model)
( 1− y )
( )
f y θi , β j , ψ = g ( θi , β j , ψ ) 1 − g ( θi , β j , ψ )
y
,
where g(θi, βj, ψ)â•›=â•›f(1|θi, βj, ψ). Examples of models of this type include
• Item response models (IRMs) [35], also called item response theory (IRT) models,
typically take g(θi,βj,ψ) to be a logistic or other sigmoidal function, such as
exp a j ( θi − b j )
g ( θi , β j , ψ ) = , (10.2)
1 + exp a j ( θi − b j )
where
θi is a continuous index of student i’s general facility in the task domain (perhaps inter-
pretable as a crude measure of the aggregation of skills or other knowledge compo-
nents needed to succeed in that domain)
βjâ•›=â•›(aj, bj) characterizes the task (aj is related to the power of the task response to classify
high/low facility students; and bj is a measure of task difficulty—it indicates the value
of θ for which a student has a 50% chance of succeeding on the task)
When θi has only one component, the model is called a unidimensional IRM; otherwise
it is a multidimensional IRM. Examples of applications of IRMs to online tutoring data
include Ayers and Junker [3] and Feng et al. [11]. Chang and Ying [8] consider the use of
IRMs to select “the next task” in a computerized adaptive test (CAT).
146 Handbook of Educational Data Mining
• Latent class models (LCMs) [14], also known as latent class analysis (LCA) mod-
els, take θi as a discrete “class membership” variable, θiâ•›∈â•›{1, …, C}, and take
βjâ•›=â•›(bj1, …, bjC) to be the success probabilities in each class, so that
g ( θi , β j , ψ ) = b jθi . (10.3)
That is to say, if student i is in class c(θiâ•›=â•›c), then P[yijâ•›=â•›1]â•›=â•›bjc. LCMs have been used
extensively for decades; Langeheine and Rost [19] and Rost and Langeheine [30]
provide recent reviews, and van der Maas and Straatemeier [36] give an example of
using LCMs to detect different cognitive strategies (see also [25, Chapter 4]).
• Cognitive diagnosis models (CDMs) [7] posit a set of latent, discrete skills, or more
generally KCs, that underlie task performance. A typical model is the reparam-
eterized unified model (RUM) [7, pp. 1010ff.], in which
g ( θi , β j , ψ ) = π j ∏ r( jk
1− θik ) q jk
, (10.4)
k =1
where
θi ∈â•›{0, 1}K is a vector of K binary indicators of presence/absence of each KC
βjâ•›=â•›(πj, rj1, …, rjK) contains the probability πj of success if all KCs are present and
the discounts rjk associated with absence of each KC
ψâ•›=â•›Qâ•›=â•›[qjk] is an incidence matrix indicating connections between KCs and this
yij [4,34]
It is easy to see that this model is also an LCM with Câ•›=â•›2K latent classes and constraints on
the bjc’s determined by the right-hand side of (10.3) [20].
Other extensions and elaborations of these may be found in [33]. For example, models
for discrete multi-valued yij ∈â•›{1,…, R} [41, pp. 115ff.], continuous yij [40], and other forms of
yij are also in use.
The generative model for J tasks is usually obtained by multiplying the per-task models
together,
J
P ( y i1 , … , y ij θi , β, ψ ) = ∏ f (y
j =1
ij )
θi , β j , ψ , (10.5)
where βâ•›=â•›(β1,…, βJ). In probabilistic language, we say the responses are conditionally
independent given the parameters θ, β, and ψ. Figure 10.2 shows the typical structure of
(10.5) for a unidimensional IRM; and Figure 10.3 shows a possible structure for a CDM.*
Especially in the latter case, it is clear that these models can be thought of as layers in a
Bayes net with observable (yij) and hidden (θi) nodes and links between them character-
ized by β and ψ. A formal approach linking Bayes net modeling to educational assess-
ment, and leading to many interesting elaborations of these models, is provided by
Mislevy et al. [23].
* Following the usual semantics for graphical models, if deleting a subset S of nodes in the graph produces
disconnected subgraphs, then the subgraphs are conditionally independent given S [10].
Modeling Hierarchy and Dependence among Task Responses in Educational Data Mining 147
θi
β1 βJ
β2
Figure 10.2
Typical conditional independence structure for a single latent variable θi and multiple task responses yij (e.g.,
unidimensional item response model). βj ’s control model for the edges of the graph, as in (10.1). Directed edges
point to variables that are conditionally independent, given the variables from which the edges emanate.
Figure 10.3
Typical conditional independence structure for multiple discrete latent KCs θi1,…,θiK (K = 5 in this example) and
multiple task responses yij. βj ’s (not shown, to reduce clutter) control model for the edges of the graph, as in (10.1).
the generative model for the two-stage “experiment” of randomly sampling a student and
observing that student’s task response is
∫
Pβ , ψ ( y i1 , … , y iJ ) = P ( y i1 , … , y iJ θi , β, ψ )p ( θi ) dθi , (10.6)
( ) = ∫ P ( y , y θ , β, ψ ) p (θ ) dθ ,
Pβ, ψ y i1 y i 2 i1 i2 i i i
( )
Pβ, ψ y i1 y i 2 =
(y )
(10.7)
Pβ, ψ i2
∫ P ( y θ , β, ψ) p (θ ) dθ
i2 i i i
and in most cases this will depend on the value of yi2. Whenever, as would usually be the
case, P[yijâ•›=â•›1] is a nondecreasing function of θi, there can be a very strong positive depen-
dence between the yij ’s for each i, as shown by Holland and Rosenbaum [15].
The dependence here is a mathematical consequence of not “clamping” θi to a fixed
value. If we assume θi is fixed, then we would use components from (10.5) instead of (10.6)
in (10.7), and we would find that Pβ,ψ(yi1|yi2)â•›=â•›Pβ,â•›ψ(yi1), i.e., the responses are indepen-
dent when θi is fixed (and β and ψ are known). This is the psychometric principle of local
independence.
If we wish to make inferences about θi, for fixed and known β and ψ, we have a choice
between working with (10.5) or (10.6). Generally speaking, if yi1,…, yiJ are positively cor-
related as in (10.6), they will not provide J independent pieces of information, and infer-
ences about θi will be less certain—higher standard errors of estimation—than if the
yij are independent as in (10.5). On the other hand, the prior distribution p(θi) in (10.6)
can incorporate additional information about the student and drive standard errors of
estimation down again. The decision about whether to use (10.5) or (10.6) may depend to
some extent on the tradeoff between the effects of dependence and prior information on
inferences about θi.
C Jc
• If students’ task performances are rated by multiple raters of varying quality (e.g.,
human raters), then a variable similar to γic in (10.8) can be incorporated to model
individual differences in severity and internal consistency of multiple raters. As
before, the additional dependence due to averaging over γi(j) increases standard
errors of estimation for θi. Patz et al. [26] develop this approach in detail and com-
pare it to models that ignore this extra dependence.
Patz et al. [26, Table 3] show, in a simulated example, that 95% intervals fail to cover
four out of five true task “difficulty” parameters (βj’s) when dependence due to rater
variability is ignored in the model—indicating severe estimation bias—whereas a
model that accounts for this dependence covers all of the βj’s with mostly narrower
intervals.* In the same example [26, Table 4], they show that 95% intervals for estimat-
ing θi’s are, on average, about half as wide in the model that ignores the dependence
structure,† as in the model that accounts for it—indicating unrealistic optimism in
the simpler model about how precisely θi’s have been estimated. In order to obtain
intervals this narrow in the model that correctly accounts for the dependence, we
would need roughly three times as many task responses from each student.
• If multiple tasks are generated from the same task model (sometimes referred to
as “clones” or “morphs” or “item families”; e.g., questions #2 and #3 in Figure 10.1),
then they may share dependence similar to that of (10.8), but expressed through
latent variables γj(c) for the common part of βj for tasks from the same item model c.
Johnson and Sinharay [16] explore this model in detail. In empirical examples
(Figures 10.3 and 10.4), they find that ignoring this source of dependence produces
biased and less certain estimates of task parameters (βj’s) analogous to those in
Patz et al. [26], but has a negligible effect on inferences for θi ’s.
In each situation, some aspect of the way the tasks are generated or scored produces
extra dependence that increases our uncertainty (standard error) in making inferences
about θi. Whether the increase is enough to worry about depends on preliminary analy-
sis in each possible context. Johnson et al. [17] provide a useful review of this class of
models.
* The narrower intervals arise from “borrowing” or “pooling” information across multiple ratings of the same
task, in the dependence model, for estimating that task’s difficulty.
† The intervals are narrower because the simpler model treats each task rating as an independent piece of infor-
mation; whereas different ratings of the same task do not contribute independent information about students’
cognitive status.
150 Handbook of Educational Data Mining
τi1 τi2
Figure 10.4
Hierarchical elaboration of the model in Figure 10.3. Here τi1 and τi2 are higher-order latent variables that model
dependence between groups of KC variables θiK, and again yij are task responses.
N J
∑∏ f ( y )
1
ij θi , β j , ψ , (10.9)
N i =1 j =1
which exhibits the same sort of nonindependence among response variables as (10.6)
(since, in essence, it replaces integration against the density p(θi) with summation against
the uniform weights 1/N).
This dependence due to aggregation is critical to many data-driven approaches to infer-
ring latent structure. Desmarais’ POKS algorithm, for example as described in [6], finds
“surmise” or prerequisite relationships among tasks by testing for aggregation depen-
dence, which can be traced to latent structure if desired. Pavlik et al. [27] apply a similar
procedure to improve the KC model of an existing tutor.
10.3╇ Hierarchy
While statistical dependence between task responses may be due to a common source of
uncertainty, as in (10.6), or to a common source of variation that we do not try to condition
on in the model, as in (10.8) or (10.9), tasks may also become dependent because of a hierar-
chical relationship between them, or between the KCs that underlie performance on them.
I now turn to these sources of dependence.
Modeling Hierarchy and Dependence among Task Responses in Educational Data Mining 151
the nodes of the network. However, differing data collection methodologies and differing
computational details mean that the models are not usually treated in the same literatures.
One exception is the text by Gelman and Hill [13], which is nearly unique in suggesting
models that include both hierarchical structure for social/institutional dependence as well
as some structures for between-task dependence. Singer and Willet [32] provide another
excellent resource, especially for using hierarchical models to study change over time.
Feng et al. [11] apply multilevel models to choose among KC sets of varying granularity
for online tutoring data that include a time component. In their model, individual differ-
ences in the learning curve are modeled as in (10.6), creating between-task dependence.
The model could easily be expanded to incorporate dependence due to clustering of stu-
dents within different classrooms, schools, etc.
Fox [12] illustrates the use of models that incorporate both between-task dependence as
well as dependence due to clustering of students within schools in three educational stud-
ies: a comparison of mathematics task performance among students in Dutch schools that
do or do not participate in an exit exam program; an examination of the effect of an adap-
tive teaching technique on fourth grade mathematics achievement; and an exploration of
differences within and between schools in the West Bank, in seventh grade mathematics
achievement.
Correctly accounting for dependence in such studies will lead to some increased effect
sizes and greater statistical significance and some lower effect sizes and lesser statistical
significance, even in the same study. For example, Fox [12, Table 5] compares a model that
fully accounts for both the hierarchical structure of the data and the dependence between
tasks within student, with a model that ignores this dependence by working with total
scores instead of task scores. (Both models account for dependence due to hierarchical
nesting of students within schools due to institutional structure; ignoring this would pro-
duce many falsely significant comparisons.) The model that fully accounts for dependence
has moderately high standard errors for most parameters, and in some cases produces
offsetting increases in the parameter estimates themselves. A baseline aptitude measure is
found to have a small but significant effect on student achievement in the simpler model,
whereas in the model that accounts for between-task dependence, it has a larger, still sig-
nificant effect. On the other hand, the adaptive instruction intervention was found to have
a positive, significant effect in the simpler model; however, it was marginally lower and
not significant (due to a greater standard error estimate) in the dependence-aware model.
Thus, accounting for dependence is not always a “win” and not always a “loss” for detect-
ing effects. It will, however, produce more defensible and more generalizable inferences.
10.4╇ Conclusions
In this chapter, I have reviewed two commonly-seen dependence phenomena for task
response data in EDM: between-task dependence due to common underlying structure
(skills, knowledge components) or common contexts (common topic or task model, com-
mon scoring procedure), and between-student dependence due to common social or insti-
tutional settings. The statistical models used to account for these sources of dependence
are mathematically similar, though often treated in different parts of the literature.
The models I have considered are static in nature. In many EDM settings, dynamic
models that account for change over time, such as latent Markov learning models, are
Modeling Hierarchy and Dependence among Task Responses in Educational Data Mining 153
also appropriate. Although a review of learning models is beyond the scope of this chap-
ter, Rijmen et al. [29] (and the references therein) provide recent, interesting examples of
formally combining response models with latent learning models, and adapting modern
computational methods for complete inferences that appropriately account for sources of
dependence.
Dependence between tasks in aggregated student response data can be exploited to
learn about the latent structure underlying the responses (how many skills or knowledge
components, how are they related to particular tasks, etc.). However, when the latent struc-
ture is taken as known and the goal is to make inferences about individual students’ pro-
ficiencies, or about differences between different groups of students (say, a treatment and
a control group), the effect of dependence between tasks or between students is usually to
increase uncertainty (standard errors of estimation). Analyses that do not account for this
dependence tend to produce answers that are too optimistic about what can be detected;
and that in any event are more difficult to defend and less likely to generalize.
Acknowledgments
Parts of this work were supported by the U.S. Department of Education Grants
#R305K030140 and #R305B04063, and by the National Science Foundation Award
#DMS-0240019.
References
1. Almond, R.G., DiBello, L.V., Mouder, B., and Zapata-Rivera, J.-D. (2007). Modeling diagnostic
assessments with Bayesian networks. Journal of Educational Measurement, 44, 341–359.
2. Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M.I. (2003). An introduction to MCMC for
machine learning. Machine Learning, 50, 5–43. Retrieved June 2008 from http://www.springer-
link.com/content/100309/
3. Ayers, E. and Junker, B.W. (2008). IRT modeling of tutor performance to predict end-of-year
exam scores. Educational and Psychological Measurement, 68, 972–987.
4. Barnes, T. (2005). Q-matrix method: Mining student response data for knowledge. In: Proceedings
of the AAAI-05 Workshop on Educational Data Mining, Pittsburgh, PA (AAAI Technical Report
#WS-05-02).
5. de la Torre, J. and Douglas, J.A. (2004). Higher-order latent trait models for cognitive diagnosis.
Psychometrika, 69, 333–353.
6. Desmarais M.C. and Pu, X. (2005). A Bayesian inference adaptive testing framework and its
comparison with item response theory. International Journal of Artificial Intelligence in Education,
15, 291–323.
7. DiBello, L.V., Roussos, L.A., and Stout, W.F. (2007). Review of cognitively diagnostic assess-
ment and a summary of psychometric models. In: Rao, C.R. and Sinharay, S. (eds.), Handbook of
Statistics, Vol. 26 (Psychometrics). New York: Elsevier, Chapter 31, pp. 979–1030.
8. Chang, H.-H. and Ying, Z. (2009). Nonlinear sequential designs for logistic item response the-
ory models with applications to computerized adaptive tests. Annals of Statistics, 37, 1466–1488.
9. Darroch, J.N., Fienberg, S.E., Glonek, G.F.V., and Junker, B.W. (1993). A three-sample multiple
recapture approach to census population estimation with heterogeneous catchability. Journal of
the American Statistical Association, 88, 1137–1148.
154 Handbook of Educational Data Mining
10. Edwards, D. (2000). Introduction to Graphical Modeling, 2nd edn. New York: Springer.
11. Feng, M., Heffernan, N.T., Mani, M., and Heffernan, C. (2006). Using mixed-effects mod-
eling to compare different grain-sized skill models. In: Beck, J., Aimeur, E., and Barnes, T.
(eds.), Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press,
pp. 57–66. Technical Report WS-06-05. ISBN 978-1-57735-287-7. Preprint obtained from http://
www.assistment.org
12. Fox, J.-P. (2004). Applications of multilevel IRT modeling. School Effectiveness and School
Improvement, 15, 261–280. Preprint obtained from http://users.edte.utwente.nl/fox/
13. Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models.
New York: Cambridge University Press.
14. Hagenaars, J. and McCutcheon, A.L. (eds.) (2002). Applied Latent Class Analysis. New York:
Cambridge University Press.
15. Holland, P.W. and Rosenbaum, P.R. (1986). Conditional association and unidimensionality in
monotone latent variable models. Annals of Statistics, 14, 1523–1543.
16. Johnson, M.S. and Sinharay, S. (2005). Calibration of polytomous item families using Bayesian
hierarchical modeling. Applied Psychological Measurement, 29, 369–400.
17. Johnson, M.S., Sinharay, S., and Bradlow, E.T. (2007). Hierarchical item response models. In:
Rao, C.R. and Sinharay, S. (eds.), Handbook of Statistics, Vol. 26 (Psychometrics). New York:
Elsevier, Chapter 17, pp. 587–606.
18. Junker, B.W. (2007). Using on-line tutoring records to predict end-of-year exam scores:
Experience with the ASSISTments project and MCAS 8th grade mathematics. In Lissitz, R.W.
(ed.), Assessing and Modeling Cognitive Development in School: Intellectual Growth and Standard
Settings. Maple Grove, MN: JAM Press.
19. Langeheine, R. and Rost, J. (eds.) (1988). Latent Trait and Latent Class Models. New York/London:
Plenum Press.
20. Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64,
187–212.
21. Minka, T. (2009). Tutorials on Bayesian Inference. Retrieved July 2009 from http://www.research.
microsoft.com/~minka/papers/
22. Minka, T. (2009). Automating variational inference for statistics and data mining. In: Invited
Presentation, 74th Annual Meeting of the Psychometric Society. Cambridge, England, July 2009.
Abstract retrieved September 2009 from http://www.thepsychometricscentre.com/
23. Mislevy, R.J., Steinberg, L.S., and Almond, R.G. (2003). On the structure of educational assess-
ment. Measurement: Interdisciplinary Research and Perspective, 1, 3–62.
24. Murphy, K.P. (2001). The Bayes net toolbox for MATLAB. In Wegman, E.J., Braverman, A.,
Goodman, A., and Smyth, P. (eds.), Computing Science and Statistics 33: Proceedings of the 33rd
Symposium on the Interface, Costa Mesa, CA, June 13–16, 2001, pp. 331–350. Retrieved September
2009 from http://www.galaxy.gmu.edu/interface/I01/master.pdf
25. National Research Council (2001). Knowing What Students Know: The Science and Design of
Educational Assessment. Washington, DC: National Academy Press.
26. Patz, R.J., Junker, B.W., Johnson, M.S., and Mariano, L.T. (2002). The hierarchical rater model
for rated test items and its application to large-scale educational assessment data. Journal of
Educational and Behavioral Statistics, 27, 341–384.
27. Pavlik, P., Cen, H., Wu, L., and Koedinger, L. (2008). Using item-type performance covariance
to improve the skill model of an existing tutor. In: Baker, R.S.J.D., Barnes, T., and Beck, J.E.
(eds.), Educational Data Mining 2008: First International Conference on Educational Data Mining,
Proceedings, Montreal, Canada, June 20–21, 2008, pp. 77–86.
28. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San
Mateo, CA: Morgan Kaufman Publishers.
29. Rijmen, F., Vansteelandt, K., and Paul De Boeck, P. (2008). Latent class models for diary method
data: Parameter estimation by local computations. Psychometrika, 73, 167–182.
30. Rost, J. and Langeheine, R. (eds.) (1997). Applications of Latent Trait and Latent Class Models in the
Social Sciences. New York: Waxmann.
Modeling Hierarchy and Dependence among Task Responses in Educational Data Mining 155
31. Romero, C. and Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert
Systems with Applications, 33, 135–146.
32. Singer, J.D. and Willett, J.B. (2003). Applied Longitudinal Data Analysis: Modeling Change and
Occurrence. New York: Oxford University Press.
33. Rao, C.R. and Sinharay, S. (eds.) (2002). Handbook of Statistics, Vol. 26 (Psychometrics). New
York: Elsevier.
34. Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diag-
nosis. In: Frederiksen, N., Glaser, R., Lesgold, A., and Shafto, M.G. (eds.), Diagnostic Monitoring
of Skill and Knowledge Acquisition. Hillsdale, NJ: Erlbaum, pp. 453–488.
35. van der Linden, W.J. and Hambleton, R.K. (eds.) (1997). Handbook of Modern Item Response
Theory. New York: Springer.
36. van der Maas, H.L.J. and Straatemeier, M. (2008). How to detect cognitive strategies:
Commentary on ‘Differentiation and integration: Guiding principles for analyzing cognitive
change’. Developmental Science, 11, 449–453.
37. Wang, X., Bradlow, E.T., and Wainer, H. (2002). A general Bayesian model for testlets: Theory
and applications. Applied Psychological Measurement, 26, 109–128.
38. Weaver, R. (2008). Parameters, predictions, and evidence in computational modeling: A statisti-
cal view informed by ACT-R. Cognitive Science, 32, 1349–1375.
39. Worcester Polytechnic Institute. (2009). Assistment: Assisting Students with their Assessments.
Retrieved January 2009 from http://www.assistment.org/
40. Yanai, H. and Ichikawa, M. (2007). Factor analysis. In: Rao, C.R. and Sinharay, S. (eds.), Handbook
of Statistics, Vol. 26 (Psychometrics). New York: Elsevier, Chapter 9, pp. 257–296.
41. Yen, W.M. and Fitzpatrick, A.R. (2006). Item response theory. In: Brennan, R.L. (ed.), Educational
Measurement, 4th edn. Westport, CT: American Council on Education & Praeger Publishers,
Chapter 4, pp. 111–154.
Part II
Case Studies
11
Novel Derivation and Application of Skill
Matrices: The q-Matrix Method
Tiffany Barnes
Contents
11.1 Introduction......................................................................................................................... 159
11.2 Relation to Prior Work........................................................................................................ 160
11.3 Method................................................................................................................................. 162
11.3.1 q-Matrix Algorithm................................................................................................ 162
11.3.2 Computing q-Matrix Error.................................................................................... 163
11.3.3 Hypotheses and Experiment................................................................................. 165
11.4 Comparing Expert and Extracted q-Matrices................................................................. 165
11.4.1 Binary Relations Tutorial, Section 1 (BRT-1)........................................................ 165
11.4.2 Binary Relations Tutorial, Section 2 (BRT-2)........................................................ 166
11.4.3 Binary Relations Tutorial, Section 3 (BRT-3)....................................................... 166
11.4.4 How Many Concepts and How Much Data........................................................ 168
11.4.5 Summary of Expert-Extracted Comparison....................................................... 169
11.5 Evaluating Remediation..................................................................................................... 170
11.6 Conclusions.......................................................................................................................... 170
References...................................................................................................................................... 171
11.1╇ Introduction
For a tutor or computer-aided instructional tool, a skill matrix represents the relationship
between tutor problems and the underlying skills students need to solve those problems.
Skill matrices are used in knowledge assessment to help determine probabilities that skills
are learned and to select problems in mastery learning environments. Typically, skill matri-
ces are determined by subject experts, who create a list of target skills to be taught and asso-
ciate each problem with its related skills in the skill matrix. Experts can disagree widely
about the skills needed to solve certain problems, and using two very different skill matri-
ces in the same tutor could result in very different knowledge assessments and student
learning experiences. We argue that skill matrices can be empirically derived from student
data in ways that best predict student knowledge states and problem-solving success, while
still assuming that students may have made slips or guesses while working on problems.
In this chapter, we describe the q-matrix method, an educational data-mining technique
to extract skill matrices from student problem-solving data and use these derived skill
matrices, or q-matrices, in novel ways to automatically assess, understand, and correct
student knowledge. Through exploiting the underlying assumption that students usually
159
160 Handbook of Educational Data Mining
“have a skill or not,” the q-matrix method effectively groups students according to similar
performance on tutor problems and presents a concise and human-understandable sum-
mary of these groups in matrix form. Using this data-derived skill matrix, new students
can be automatically classified into one of the derived groups, which is associated with a
skill profile, and this profile can be used with the skill matrix to automatically determine
which problems a student should work to increase his or her skill levels. This technique
can be used in a mastery learning environment to iteratively choose problems until a stu-
dent has attained the threshold mastery level on every skill.
We present a suite of case studies comparing the q-matrix method to other automated
and expert methods for clustering students and evaluating the q-matrix method for direct-
ing student studies. The chapter concludes with a discussion of the ways the q-matrix
method can be used to augment existing computer-aided instructional tools with adaptive
feedback and individualized instruction.
TABLE 11.1
Example q-Matrix
q1 q2 q3 q4 q5 q6 q7
con1 1 0 0 0 0 1 1
con2 1 1 0 1 0 0 1
con3 1 1 1 0 0 0 0
graph and compares these points with student work [10,17]. This plotting allows for direct
comparison of hypothesized rules and actual errors, without having to catalog every pos-
sible mistake. For example, for the sum −1 + −7, a mistaken rule to add absolute values
yields an answer of 8 and the correct rule gives −8. Each such answer can be compared to
student results, and the rule with the closest predicted answer is assumed to correspond
to the one the student is using.
This idea of determining a student’s knowledge state from question responses inspired
the creation of a q-matrix, a binary matrix showing the relationship between test items and
latent attributes, or concepts. Students were assigned knowledge states based on their test
answers and the manually constructed q-matrix. Since this method was used primarily for
diagnosis on tests, no learning was assumed to occur between problems.
A q-matrix is a matrix that represents relationships between a set of observed variables
(e.g., questions) and latent variables that relate these observations. We call these latent vari-
ables “concepts.” A q-matrix, also known as a skill matrix or “attribute-by item incidence
matrix,” contains a one if a question is related to the concept, and a zero if not. An example
q-matrix is given in Table 11.1. In this q-matrix, the observed variables are seven questions
q1–q7, and the concepts are c1–c3. For simplicity, this example is given with binary values.
Brewer extended q-matrix values to range from zero to one, representing the probability
that a student will answer a question incorrectly if he or she does not understand the con-
cept [11]. For a given q-matrix Q, the value of Q(Con,Ques) represents the conditional proba-
bility that a student will miss the question Ques given that they do not understand concept
Con. In Table 11.1, this means that question q1 will not be answered correctly unless a
student understands all three concepts c1–c3. Question q2 will not be answered correctly
unless a student understands concepts c2 and c3, even if the student does not understand
concept c1. In contrast, question q5 can be answered correctly without understanding any
of the concepts c1–c3. This indicates that prerequisite knowledge not represented in the
q-matrix affects the answer to q5. In a more general setting, q-matrix “concepts” can be
thought of similarly to the components extracted in principal components analysis, since
they represent abstract data vectors that can be used to understand a larger set. These con-
cepts can also be used to describe different clusters of observed data.
Birenbaum et al.’s rule space research showed that it is possible to automate the diag-
nosis of student knowledge states, based solely on student item-response patterns and
the relationship between questions and their concepts [10,17]. Though promising, the rule
space method is time consuming and topic-specific, and requires expert analysis of ques-
tions. The rule space method provides no way to show that the relationships derived by
experts are those used by students, or that different experts will create the same rules.
To better understand what q-matrix might explain student behavior, Brewer created a
method to extract a q-matrix from student data and found that the method could be used
to recover knowledge states of simulated students [11]. Sellers designed the binary relations
tutorial (BRT) and conducted an experiment to determine if the q-matrices extracted from
student responses corresponded well with expert-created q-matrices for each of the three
sections of the BRT [16]. With a small sample of 17 students, she found that Sections 1 and
2 corresponded well with expert analysis but that Section 3 did not [16]. Barnes and Bitzer
applied the method to larger groups of students [6] and found the method comparable to
standard knowledge discovery techniques for grouping student data [4,7]. In particular,
the q-matrix method outperformed factor analysis in modeling student data and resulted
in much more understandable concepts (when compared to factors), but had higher error
than k-means cluster analysis on the data. However, to use cluster analysis for automated
remediation, experts would have to analyze the clusters to determine what misconcep-
tions each cluster represented, and then determine what material to present to students in
the cluster. With the q-matrix method, each cluster directly corresponds to its own repre-
sentative answer and concept state that can be used for remediation, as described below.
Barnes later applied the q-matrix method in a novel way to understand strategies for
solving propositional proofs [3]. Barnes et al. also compared the method to facets for giv-
ing teachers feedback on student misconceptions [8]. In this chapter, we detail the q-matrix
method and its use in a computer-aided instructional tutorial [5].
11.3╇ Method
11.3.1╇ q-Matrix Algorithm
The q-matrix algorithm is a simple hill-climbing algorithm that creates a matrix represent-
ing relationships between concepts and questions directly from student response data [11].
The algorithm varies c, the number of concepts, and the values in the q-matrix, minimizing
the total error for all students for a given set of n questions. To avoid local minima, each
search is seeded with different random q-matrices and the best search result is kept.
In the q-matrix algorithm, we first set c, the number of concepts, to one, and then gener-
ate a random q-matrix of concepts versus questions, with values ranging from zero to one.
We then calculate the concept states, which are binary strings representing the presence
(1) or absence (0) of each concept. For each concept state, we calculate a corresponding pre-
dicted answer or ideal response vector (IDR). We compare each student response to each
IDR and assign the response to the closest IDR and parent concept state, with an “error”
being the distance from the response to the IDR. The total q-matrix error is the sum of
these errors associated with assigning students to concept states, over all students.
We then perform hill-climbing by adding or subtracting a small fixed delta to a single
q-matrix value, and recomputing its error. If the overall q-matrix error is improved, the
change is saved. This process is repeated for all the values in the q-matrix several times
until the error in the q-matrix is not changing significantly. After a q-matrix is computed in
this fashion, the algorithm is run again with a new random initial q-matrix several times,
and the q-matrix with minimum error is saved, to avoid local minima. The final result is
not guaranteed to be the absolute minimum, but provides an acceptable q-matrix for c, the
current given number of concepts.
To determine the best number of concepts to use in the q-matrix, this algorithm is
repeated for increasing values of c, until a stopping criterion is met. There are two stopping
criteria to consider: either stopping when the q-matrix error falls below a pre-set thresh-
old, such as that of less than 1 per student as used here, or by looking for a decrease in the
marginal reduction of error by adding more concepts. In all cases, the number of concepts
should be as small as possible to avoid over-fitting to particular student responses, since
TABLE 11.2
q-Matrix Method Pseudo-Code for Fixed NumCon
Set MinErrorâ•›=â•›LargeNumber;
For Starts = 1 to NumStarts
Randomly initialize Q[NumCon][NumQues];
Set Q*â•›=â•›Q; Set CurrErrorâ•›=â•›Error(Q);
For Iterâ•›=â•›1 to NumIter;
For c = 1 to NumCon
For q = 1 to NumQues
Q*[c][q]â•›=â•›Q[c][q] + Delta;
If (Error(Q*)â•›<â•›CurrError)
Do
Set Q = Q*; Set CurrErrorâ•›=â•›Error(Q*);
Q*[c][q]â•›=â•›Q[c][q] + Delta;
While (Error(Q*)â•›<â•›CurrError);
Else
Q*[c][q]â•›=â•›Q[c][q]â•›−â•›Delta;
While (Error(Q*)â•›<â•›CurrError)
Set Q = Q*; Set CurrErrorâ•›=â•›Error(Q*);
Q*[c][q]â•›=â•›Q[c][q]â•›−â•›Delta;
If (CurrErrorâ•›<â•›MinError)
Set BestQâ•›=â•›Q; Set MinErrorâ•›=â•›CurrError;
we do not assume that student responses are free from guesses and slips. Pseudocode for
the hill-climbing algorithm is given in Table 11.2 for a set number of concepts, NumCon,
until we meet the pre-set stopping criterion.
The parameter NumStarts allows enough random starts to avoid local minima. In the inte-
rior loop, single values in the q-matrix are optimized (by Delta at a time) while the rest of the
q-matrix is held constant. Therefore, several (NumIter) passes through the entire q-matrix
are made. In these experiments, NumStarts was 50, Delta was 0.1, and NumIter was 5.
IDR = ¬((¬s)Q)
For example, given concept state sâ•›=â•›0110 and the q-matrix Q given in Table 11.1, ¬sâ•›=â•›1001,
(¬s)Qâ•›=â•›101001. Therefore, IDRâ•›=â•›010110. (¬s)Q can be viewed as a vector of all the questions
that require concepts that are unknown for a student in concept state s. Thus, the IDR for s
is exactly the remaining questions, since none of them require concepts that are unknown
for a student in concept state s.
When the q-matrix consists of continuous probabilities, for a given question q, for each
unknown concept c, we compute the conditional probability that the student will miss
question q given that concept c is unknown by 1â•›−â•›Q(c, q), where Q(c, q) is the q-matrix value
for concept c and question q. Then, since we assume concept–question relationships to be
independent, the probability that question q will be missed is the product of 1â•›−â•›Q(c, q) over
all unknown concepts c. The final prediction that q will be answered correctly is 1 minus
this product. Mathematically, given a q-matrix Q with NumCon concepts, and a concept
state S, where S(c) denotes whether concept c is understood, we calculate the ideal response
(IDR) for each question q:
NumCon
1 S(c) = 1
IDR ( q ) = ∏ 1 − Q(c, q)
S(c) = 0
(11.1)
c =1
An IDR can be constructed then by using these probabilities as predicted answers (which
will always be incorrect since we assume no partial credit is given), or they can be rounded
to 1 or 0 according to a threshold. In our work, we use the probabilities as predictions, since
we wish to accumulate a more accurate reflection of our prediction error for each q-matrix.
Table 11.3 lists the IDRs for all the possible concept states for the q-matrix given in Table
11.1. The all-zero concept state 000 describes the “default” knowledge not accounted for in
the model, while the all-one concept state 111 describes full understanding of all concepts.
Concept state 011’s IDR corresponds to a binary OR of the IDRs for concept states 001 and 010,
plus the addition of a 1 for q2, which requires both concepts c2 and c3 for a correct outcome.
To evaluate the fit of a given q-matrix to a data set, we compute its concept states and
IDRs as in Table 11.3 and determine each data point’s nearest neighbor from the set of
IDRs. The response is then assigned to the corresponding concept state, with an associ-
ated error, which is the L1 distance between the IDR and the response. In other words,
the distance d(r,IDR) between response vector RESP and its IDR, where k ranges over all
questions, is
For example, for a response vector 0111110 and the q-matrix given in Table 11.1, the nearest
IDR would be 0111100 in concept state 011, and the error associated with this assignment
is 1, since there is only one difference between the response and its nearest IDR. The total
error for a q-matrix on a given data set is the sum of the errors over all data points.
TABLE 11.3
Ideal Response Vectors for Each Concept State
Concept State IDR Concept State IDR
000 0000100 100 0000110
001 0010100 101 0010110
010 0001100 110 0001111
011 0111100 111 1111111
Table 11.4
BRT-1 Expert and Extracted q-Matrices
q1.4
q1.1 q1.2 q1.3 Matrix Rep. of q1.5
Cart. Prod. Cart. Prod. Relations Relations Composites
Cartesian products 1 1 1 1 1
Relations 0 0 1 1 1
Composites, only 0 0 0 0 1
concept extracted
167
© 2011 by Taylor and Francis Group, LLC
168 Handbook of Educational Data Mining
TABLE 11.6
BRT-3 Extracted q-Matrix, 4 Concepts, Err/Stud: 0.72
q3.1 q3.2 q3.3 q3.4 q3.5 q3.6 q3.7
Concept #: Hasse Maximal Minimal Upper Lower Least Upper Greatest
Description Diagrams Elements Elements Bounds Bounds Bounds Lower Bounds
that examined subsets of partially ordered sets for upper and lower bounds. The third
concept grouped together questions that combined the ideas of maximal and minimal ele-
ments with upper and lower bounds.
Of the 194 students who completed the BRT-3 quiz, only 78 had distinct response vectors.
The extracted 4-concept q-matrix is given in Table 11.6. When we compare the extracted
and expert q-matrices, we find that concept 3 is similar in both of these—in both, q3.6
and q3.7 are related, but in the extracted q-matrix, these are also related to q3.4. Question
q3.3 relates to no concepts, implying that most students got this (easy) question correct.
Questions q3.4 and q3.6 are related to 2 concepts each, implying that they are more difficult
(since questions that require more concepts are less likely to be correct). Concepts 1 and 4
both relate to one question each, suggesting that no other questions were missed when a
student missed one of these.
Using the q-matrix method, we would be very unlikely to extract the “Hasse diagrams”
concept from this tutorial, since every concept state where this concept was unknown
would have an IDR of all zeroes. However, the all-zero concept state would also have an
all-zero IDR if every question relates to at least one concept, and given equal-error choices,
we preferentially assign students to the all-zero state to simplify the model created.
number of concepts as the logarithm, base 2, of the number of response vectors in the data.
Therefore, if you have 10 responses, there should be no more than 4 concepts.
We should also consider the number of questions in limiting the number of concepts.
The most fitted (and likely over-fitted) model would assign one concept to each ques-
tion, so there should be no more concepts than there are questions. Based on a prefer-
ence for a more concise concept matrix, we use the rule of thumb that the number of
concepts should be proportional to the logarithm, base 2, of the number of questions. We
can understand differences in the IDR and a student response in terms of guess and slip
rates, where a guess is when the IDR predicts an incorrect answer but the student guesses
the correct one, and a slip is when we predict a correct answer but observe an incorrect
one. To simplify our discussion, let’s assume the two rates are equal and are about 20%.
Therefore, the error associated with assigning any given response to a concept state has
an expected rate of 20% of answers being different than those in the IDR. If the number of
questions is 5, then the error rate would be 1 per student. With 10 questions, the error rate
would be 2 per student. If we expect high guess and slip rates, we would choose fewer
concepts to encompass more general groups of students. However, if we intend to tailor
responses more individually and expect low guess and slip rates, we would choose the
larger number of concepts.
Several characteristics are assumed about data sets when using the q-matrix method.
First, the data set should be representative of a class of students, with many levels of abil-
ity and stages of learning present. Second, it is assumed that learning is not occurring
between questions. Third, concepts are assumed to be conjunctive, in other words, that all
skills needed for a problem must be present for the student to solve it. Fourth, it is assumed
that the questions in the quiz or test are good questions that test concepts of interest, and
that widely divergent material is not included on the same test (though analysis of the
extracted q-matrix can help identify cases where this is not true). In simulations where the
method was used to simulate student data sets with varied sizes, guess, and slip rates, as
few as 25 unique responses were needed to recover the original q-matrix [11]. However, as
we have shown here in BRT-3, a pairing of complex material with little or no repetition of
the concepts being tested means that more data is needed to make an accurate or low-error
q-matrix.
11.6╇ Conclusions
This research represents an initial study of the effectiveness of data-derived q-matrices in
understanding and directing student learning. The method can be used with small data
sets of as few as 25 student responses to make usable q-matrices, though those derived
will work best with groups of students with similar performance and ability levels. Using
larger data sets with varied student profiles will make the derived q-matrices more general
but less tailored to a particular semester or instructor.
The questions used to collect the data are important; we assume that they are appropri-
ate, related, and written to test student knowledge of a domain. The q-matrix method can
highlight the need to adjust questions to make the test more balanced in terms of measur-
ing proficiency in several skills. Examining derived q-matrices can also help instructional
designers identify places where student behavior shows that there may be unintended
relationships among questions.
Novel Derivation and Application of Skill Matrices: The q-Matrix Method 171
As we predicted, expert and extracted q-matrices did not often coincide. However, we
were able to use extracted q-matrices to understand patterns of student responses, such
as frequently correct or incorrect questions, and could compare the extracted concepts
with our understanding of the domain to determine whether the observed patterns made
sense. We also compared the questions that self-guided students chose to review with the
questions that we would have chosen for them based on our estimates of concept under-
standing. We found that the q-matrix method often chose the same questions for review
as the self-guided students chose for themselves. Based on survey results, a majority of
students felt as if the tutorial adapted to their individual knowledge.
Future work will address several important questions about the q-matrix method.
Although the method has been validated using simulated students, a comparison of
q-matrix results on varying class sizes would yield a measure of the robustness of the
method. We also plan to compare error for expert models in explaining student perfor-
mance with q-matrix models. It would also be interesting to use the q-matrix method to
compare skill matrices derived for the same class taught by different professors.
References
1. Ainsworth, S.E., Major, N., Grimshaw, S.K., Hayes, M., Underwood, J.D., Williams, B., and
Wood, D.J. 2003. REDEEM: Simple intelligent tutoring systems from usable tools, in T. Murray,
S. Blessing, and S.E. Ainsworth (eds), Advanced Tools for Advanced Technology Learning
Environments, pp. 205–232. Amsterdam, the Netherlands: Kluwer Academic Publishers.
2. Baffes, P. and Mooney, R.J. 1996. A novel application of theory refinement to student modeling,
in Proceedings of the 13th National Conference on Artificial Intelligence, Portland, OR, pp. 403–408.
3. Barnes, T. 2006. Evaluation of the q-matrix method in understanding student logic proofs, in
Proceedings of the 19th International Conference of the Florida Artificial Intelligence Research Society
(FLAIRS 2006), Melbourne Beach, FL, May 11–13, 2006.
4. Barnes, T. 2005. Experimental analysis of the q-matrix method in automated knowledge assess-
ment, in Proceedings IASTED International Conference on Computers and Advanced Technology in
Education (CATE 2005), Oranjestad, Aruba, August 29–31, 2005.
5. Barnes, T. 2005. The q-matrix method: Mining student response data for knowledge, in
Proceedings of the AAAI-2005 Workshop on Educational Data Mining, Pittsburgh, PA, July 9–13,
2005.
6. Barnes, T. and Bitzer, D. 2002. Fault tolerant teaching and automated knowledge assessment, in
Proceedings of the ACM Southeast Conference, Raleigh, NC, April, 2002.
7. Barnes, T., Bitzer, D., and Vouk, M. 2005. Experimental analysis of the q-matrix method in
knowledge discovery, in Proceedings of the International Symposium of Methodologies for Intelligent
Systems, Saratoga Springs, NY.
8. Barnes, T., Stamper, J., and Madhyastha, T. 2006. Comparative analysis of concept derivation
using the q-matrix method and facets, in Proceedings of the AAAI 21st National Conference on
Artificial Intelligence Educational Data Mining Workshop, Boston, MA, July 17, 2006.
9. Beck, J., Woolf, B.P., and Beal, C.R. 2000. ADVISOR: A machine learning architecture for intel-
ligent tutor construction, in 7th National Conference on Artificial Intelligence, pp. 552–557. Saint
Paul, MN: AAAI Press/The MIT Press.
10. Birenbaum, M., Kelly, A., and Tatsuoka, K. 1993. Diagnosing knowledge states in algebra using
the rule-space model. Journal for Research in Mathematics Education, 24(5): 442–459.
11. Brewer, P. 1996. Methods for concept mapping in computer based education. Computer Science
Masters thesis, North Carolina State University, Raleigh, NC.
172 Handbook of Educational Data Mining
12. Feng, M., Beck, J., Heffernan, N., and Koedinger, K. 2008. Can an intelligent tutoring system
predict math proficiency as well as a standardized test? in Proceedings of the 1st International
Conference on Educational Data Mining, Montreal, Canada, June 20–21, 2008, p. 107–116.
13. Koedinger, K.R., Aleven, V., Heffernan., T., McLaren, B., and Hockenberry, M. 2004. Opening
the door to non-programmers: Authoring intelligent tutor behavior by demonstration, in
Proceedings of the 7th International Conference on Intelligent Tutoring Systems, Alagoas, Brazil,
pp. 162–173.
14. Murray, T. 1999. Authoring intelligent tutoring systems: An analysis of the state of the art.
International Journal of Artificial Intelligence in Education, 10: 98–129.
15. Razzaq, L., Heffernan, N., Feng, M., and Pardos, Z. 2007. Developing fine-grained transfer
models in the ASSISTment system. Journal of Technology, Instruction, Cognition, and Learning,
5(3): 289–304.
16. Sellers, J. 1998. An empirical evaluation of a fault-tolerant approach to computer-assisted teach-
ing of binary relations. Masters thesis, North Carolina State University, Raleigh, NC.
17. Tatsuoka, K. 1983. Rule space: An approach for dealing with misconceptions based on item
response theory. Journal of Educational Measurement, 20(4): 345–354.
12
Educational Data Mining to Support Group
Work in Software Development Projects
Contents
12.1 Introduction......................................................................................................................... 173
12.2 Theoretical Underpinning and Related Work................................................................ 174
12.3 Data....................................................................................................................................... 175
12.4 Data Mining Approaches and Results............................................................................. 176
12.4.1 Mirroring Visualizations....................................................................................... 176
12.4.2 Sequential Pattern Mining.................................................................................... 179
12.4.3 Clustering................................................................................................................. 180
12.4.3.1 Clustering Groups.................................................................................... 180
12.4.3.2 Clustering Students................................................................................. 181
12.4.4 Limitations............................................................................................................... 183
12.5 Conclusions.......................................................................................................................... 183
References...................................................................................................................................... 184
12.1╇ Introduction
Educational data mining is the process of converting data from educational systems to use-
ful information to inform pedagogical design decisions and answer educational questions.
With the increasing use of technology in education, there is a large amount of content and
electronic traces generated by student interaction with computer learning environments
such as collaborative learning systems, project management systems, intelligent teaching
systems, simulators, and educational game environments. The challenge is how to effec-
tively mine these large amounts of data, find meaningful patterns, and present them to
teachers and students in a useful form.
In this chapter, we focus on supporting the teaching and learning of group work
skills. Group work is central in many aspects of life. It is especially important in the
workplace where most often the combined efforts of a group of people are required to
complete a complex task in a given time. The context of our study is a senior software
development project, where students work in groups to develop a software solution
for a real industry client over a semester. Students use a state-of-the art online group
collaboration environment with tools such as a wiki for sharing web pages, ticketing
173
174 Handbook of Educational Data Mining
coordination and planning tool, and software repository with version control. Our goal
was threefold:
To achieve these goals, we applied three approaches: (1) mirroring visualizations to sum-
marize the huge amounts of longitudinal data and give students and teachers a bird’s-eye
view of the activity of the group; (2) sequential pattern mining to identify interactions
between team members and action sequences, characterizing strong and weak groups;
and (3) clustering of the students and groups according to their activity to find nontempo-
ral patterns characterizing student and group behavior.
The next section describes the theoretical underpinning of our work and the related
work. Then, we present the context of our study and the data used. The following sections
explain the application of the three data mining techniques, highlight the important find-
ings, and discuss how the results can be used to improve group work skills.
operate); mutual trust, closed loop communication (e.g., a person posting a message on any
media receives feedback about it and confirms this). This theory provides a language with
which to discuss group work and also guides our data mining.
The second theoretical underpinning is based on Extreme Programming [2], a success-
ful software development methodology. Some of its key principles are: constant commu-
nication within the group and with customers, simple clean design, pair programming,
testing the software from day one, delivering the code to customers as early as possi-
ble, and implementing changes as suggested. These principles helped us to identify the
features of activity that should be extracted, used in the analysis, and presented in the
visualizations.
Some researchers have investigated the use of data mining to analyze collaborative
interactions. Talavera and Gaudioso [3] applied clustering to student interaction data to
build profiles of student behaviors in a course teaching the use of Internet. Data was col-
lected from forums, email, and chat. The goal was to support evaluation of collaborative
activities and, although only preliminary results were presented, their work confirmed the
potential of data mining to extract useful patterns and get insight into collaboration pro-
files. In [4] a method based on clustering and statistical indicators was proposed. The aim
was to infer information about the collaboration process in an online collaborative learn-
ing environment. Prata et al. [5] have developed a machine-learned model, which auto-
matically detects students’ speech acts in a collaborative learning environment, and found
a positive correlation between speech acts denoting interpersonal conflict and learning
gains. Soller [6] analyzed knowledge sharing conversations using Hidden Markov models
and multidimensional scaling. However, her approach required group members to use a
special interface using sentence starters. The DEGREE system [7] allows students to sub-
mit text proposals, coedit, and refine them, until agreement is reached. However, it also
requires a special interface and user classified utterances and is also limited to a single col-
laboration medium. By contrast, we wanted to ensure that the learners used collections of
conventional collaboration tools in an authentic manner, as they are intended to be used to
support group work; we did not want to add interface restrictions or additional activities
for learners as a support for the data mining. These goals ensure the potential generality
of the tools we want to create. It also means that we can explore the use of a range of col-
laboration tools, not just a single medium, such as chat.
Mirroring visualizations to support group collaboration have been used previously [8,9].
In [10], the goal is to improve participation rate by creating an adaptive rewards system
using mirrored group and individual models. However, it significantly differs from our
goal of supporting small groups for which learning group work skills is one of the learn-
ing objectives and the group work is the key focus. In our previous work, we found that
mirroring of simple overall information about a group is valuable [11]. The work on social
translucence [12,13] has also shown the value of mirroring in helping group members to
realize how their attitude affect the group and to alter their behavior.
12.3╇ Data
The context of this study is a senior level capstone software development project course,
which runs over a semester. There were 43 students, working in 7 groups of 5–7. The
task was to develop a software solution for a client. The topics varied from creating a
176 Handbook of Educational Data Mining
computer-based driving ability test to developing an object tracking system for an art
installation.
The groups were required to use Trac [14], an open-source, web-based, project manage-
ment system for professional software development. It consists of three parts, tightly inte-
grated through hyperlinks:
The Trac usage data contained roughly 15,000 events and its size was 1.6â•›MB in mySQL
format. We also had the progressive and final student marks, which take into account both
the quality of the product and the group management process followed. Based on them,
the groups were ranked and named accordingly—Group 1 is the best group and Group 7
is the weakest.
work and the key elements of group management from Extreme Programming pointed
to the key elements that teachers and students should be able to learn from the mirror
visualizations.
We concluded that we should (1) summarize each student’s activity (daily and cumula-
tively) on each of the media (wiki, task planning tickets, and version control actions in the
software repository) and (2) summarize the flow of interaction between the team mem-
bers. We expected that some group members would have different patterns of activity
from others, e.g., the group leaders should have appeared more active on planning tasks,
reflected in creation of tickets; the absence of this pattern would suggest that the leaders
are not doing their job. We also expected that better groups would show a higher interac-
tivity than the weaker ones.
We designed a set of visualizations to meet these goals. We describe two of them. Figure
12.1 shows our main visualization, the Wattle Tree (named after an Australian native plant
with fluffy golden-yellow round flowers). Each person in the team appears as a “tree”
that climbs up the page over time. The tree starts when the user first does an action on
any of the three media considered. The vertical axis shows the date and the day number.
Wiki-related activity is represented by yellow “flowers,” circles on the left of the trees.
SVN-related activity is similarly represented, as orange flowers on the right of the trees.
The size of the flower indicates the size of the contribution. Ticket actions are represented
by leaves—the green lines: a dark green leaf on the left indicates a ticket was open by the
user and a light green leaf on the right indicates the user closed a ticket. The length of the
left leaf is proportional to the time it remained opened. Those still open are shown at a
standard, maximal size (e.g., the ones around day 41 in Figure 12.1).
As already discussed, a good team leader will typically take the responsibility for open-
ing most of the tickets. We see that the leftmost person in Figure 12.1 has opened many
tickets while the closing of tickets is more evenly distributed across the team. Although
these visualizations are intended to be meaningful for the team, rather than the outsider
viewing them, there are some features we can identify in Figure 12.1. The two students at
the left appear to have been the most active in management aspects at the wiki and tickets:
knowledge of the group bears this out. The fourth student from the left is particularly
active on SVN corresponding to a larger role in the technical development. The fifth stu-
dent from the left has a hiatus from about Day 27, corresponding to too little activity. This
group had times when several members were ill or had other difficulties and they could
see the effect of these problems in the Wattle diagram. The team member with respon-
sibility for tracking progress could use the Wattle tree to get an overview of activity at a
glance. This serves as a starting point for delving into the details, as needed, by checking
individual tickets, wiki pages and SVN documents and their histories.
Wattle trees do not contain information on who issued tickets to whom, and who con-
tributes to a wiki page. In order to visualize these activities, we use what we call an inter-
action network, inspired by the graphical notations used in Social Network Analysis
[15], and showing relationships and flows between entities. An example for one group is
shown in Figure 12.2. The nodes represent team members and the lines between them
indicate interaction between the team members, as they modify the same wiki page,
SVN file or ticket. The width of the edge is proportional to the number of interactions
between them.
These interaction graphs can be used for reflective group activities in class. They may
indicate different patterns and require deep knowledge for interpretation. For example,
a tick line may correspond to a group member writing poor code with the others having
to fix it or a group member who does not trust the others and frequently edits their code.
178 Handbook of Educational Data Mining
s1 s2 s3 s4 s5 Tutor s6 s7
FIGURE 12.1
Wattle diagram.
The Wattle tree and interaction networks were available during the semester and were
used in the weekly meetings between student groups and teachers. Individual students
found that they could see what they and the other students in the group have been doing
and whether they have been performing at appropriate levels. Teachers found that they
could identify group problems very early and help address them.
The visualization tools that we developed were also used successfully in the context of
postgraduate courses in educational technology, where groups worked collaboratively to
write essays. Thus, they have a broader value.
Educational Data Mining to Support Group Work in Software Development Projects 179
Future work will include making the Wattle trees and Interaction graph-medium wiki
interaction networks more interactive. We have recently
developed an interactive visualization, called Narcissus [16].
It supports views at three levels: group, project, and ticket,
and the user can also click on the components and see the
supporting evidence.
1. Abstraction of the raw data into a list of events for each group (a unique, chrono-
logical sequence of events). The resulting sequence for each group consisted of
1416–3395 events.
2. Splitting these long sequences into a dataset of several meaningful sequences. We
formed three types of sequences: (1) for the same resource (e.g., the same wiki page
or ticket); (2) for a group session (a session is formed by cutting the group event
sequence where gaps of minimum length of time occurred); and (3) for the same
task (a task is defined by a ticket and comprising all events linked from and to that
ticket).
3. Encoding the events suitably to facilitate data mining. We developed and used
several alphabets to describe our events in a way that could compress our data
meaningfully and we ran the GSP algorithm using all these alphabets.
Table 12.1 summarizes some of the distinctive patterns we found for the strong and weak
groups and their explanation.
After the extraction of these patterns, the teachers examined in detail the actions on
Trac and confirmed the meaningfulness of the results. The results can be used in future
teaching to provide concrete examples of patterns associated with good and poor practice,
e.g., to illustrate general principles such as effective leadership and monitoring in terms
180 Handbook of Educational Data Mining
TABLE 12.1
Sequential Patterns for Best and Weak Groups
Patterns Explanation
Frequent alteration of SVN and ticket events Tickets being updated following SVM commits
Tickets used more often than wiki, and many Higher and better use of the ticketing system; tickets
consecutive ticket events are more task-oriented than wiki and ticketing
system is better indicator of the work actually done
Many consecutive SVN events Work more often committed to group repository
Many SVN events on the same files Regular commits to the repository (software
development done)
Wiki used more often than tickets Wikis are less focused and task-oriented than tickets
Leaders involved in too much technical work. Tickets Less effective leadership.
created by Leader, followed by ticketing events by other
members (e.g., open, edit, close), before completing work.
Tracker rather than Leader creating and editing Tracker performing leadership duties
many tickets
of wikis, tickets, and SVN activity. The patterns can also be extracted and presented to
students during the semester as formative feedback to help rectify poor group operation
and encourage effective practices.
12.4.3╇ Clustering
We applied clustering to find patters characterizing the student behavior at two levels:
groups of students and individual students. Clustering is an unsupervised method for
finding groups of similar objects using multiple data attributes. We used the classical
k-means clustering algorithm as it is simple, effective, and relatively efficient [21].
TABLE 12.2
Clustering Groups of Students
Distinguishing Characteristics
Cluster Tickets Wiki SVN
1: Very frequent activity High edits per wiki page Very frequent activity
Group 1
High events per ticket High wiki page usage span
2: Moderately frequent activity High edits per wiki page Moderately frequent
Groups 5 and 7 activity
Low number of events per Low number of lines added/
ticket and low % of ticket deleted per wiki edit
update events
days ticketing occurred, number of events per wiki page, wiki page usage span, number
of lines edited/deleted per wiki edit, number of days SVN activity occurred, and others.
The number of clusters was set to 3, based on our knowledge about the groups, e.g., their
progressive and final marks, which reflected the quality of the processes followed and the
quality of the final product. The resulting clusters for k-means and their distinguishing
characteristics are shown in Table 12.2. It should be noted that we also experimented with
hierarchical clustering and EM for k = 3 and obtained the same clusters.
The first cluster contains only Group 1 (the top group) and is characterized with active
use of the three media and high number of active events such as ticket acceptance and
update, wiki page edits, and SVN activities. The second cluster contains Groups 5 and 7
and it is characterized with moderately frequent use of tickets and SVN and low num-
ber of active events. Although there were many edits of wiki pages as in cluster 1, they
involved small modifications (lines added/deleted). The third cluster contains the remain-
ing groups and is characterized by overall low ticketing and SVN activity. While low tick-
eting activity is typically associated with weaker groups, Group 2, the second best group,
also showed this characteristic as it was reluctant to use the ticketing system.
These results were useful for the teachers. For example, the teachers had some sense that
Group 1 was well managed but this cluster analysis pointed to some interesting behaviors
distinguishing this group that we did not noticed before. For example, we found that the
group made extensive use of the wiki on each ticket for communication about the task
associated with it, which is a novel and effective way to use TRAC. This new understand-
ing was used in subsequent teaching and was evaluated by the teachers as helpful.
TABLE 12.3
Clustering Individual Students
Distinguishing Characteristics
Cluster Label Cluster Size Tickets Wiki SVN
Managers 8 students High ticketing activity High wiki activity Moderate SVN activity
Involved in many
tickets
Trac-Oriented 9 students Moderately high Moderate wiki activity Very high SVN activity
Developers ticketing activity
Ticketing occurring on
many different days
Loafers 11 students Low ticketing activity Low wiki activity Low SVN activity
Others 15 students Moderately low Moderately low wiki Many SVN events on
ticketing activity activity days which SVN
events occurred
Many wiki events on
days which wiki events
occurred
with their characteristics. Based on our interpretation of the cluster characteristics, a clus-
ter label was assigned (“Managers,” “Trac-oriented developers,” “Loafers,” and “Others”).
We then looked at the group composition, taking into account the labels assigned to each
student; see Table 12.4. The stars (*) show where the group’s designated manager (leader)
was placed by the clustering. This role was allocated to one person after the initial start-
up period. For example, Group 7 consisted of 7 students; 1 of them was clustered as
“Manager,” 0 as “Trac-oriented developer,” 2 as “Loafers” and 4 as “Others”; the desig-
nated manager was clustered as “Manager.”
Some new interesting differences between the groups emerged. Groups 2 and 3 (placed
in the same cluster, see Table 12.2) differ by Group 3’s lack of a manager and Group 2’s
lack of Trac-oriented developers. The first finding was consistent with our knowledge of
the leadership problems this group encountered, with the original manager leaving the
course and another student taking over. The second finding was validated in a group
interview, where the main developers expressed a reluctance to use Trac. Group 5 is also
TABLE 12.4
Group Composition
Trac-Oriented
Managers Developers Loafers Others
Group 1 *1 3 1 1
Group 2 *1 0 1 3
Group 3 0 1 2 *3
Group 4 *1 3 2 0
Group 5 3 *1 0 3
Group 6 *1 1 3 1
Group 7 *1 0 2 4
Educational Data Mining to Support Group Work in Software Development Projects 183
distinctive in its excess of managers, which was further complicated by their designated
manager being placed in the “Trac-oriented developers” cluster. It appears that this weak
leadership resulted in others reacting to fill the manager’s role, with their technical work
subsequently being compromised, which is a pattern to be aware of in future groups.
To verify if these patterns were evident earlier, we ran the clustering using the data only
from the first seven weeks. We found that some of these key results had already emerged.
For example, the Group 5 leader was already showing the developer’s behaviors. Had the
teachers been aware of this, they may have been able to help this group deal with this
problem early enough to have made a difference. The presence of three loafers was also
apparent in Group 6. The early data also showed leadership’s behaviors by all other lead-
ers at that stage. These results have great value not only for the teachers but also for the
individuals to understand their behavior and change it, if needed.
In conclusion, we found clustering to be useful, revealing interesting patterns character-
izing the behavior of the groups and individual students. Strong groups are characterized
by effective group leadership and frequent use of the three media, with high number of
active events. Some important results are evident in early data, in time for timely problem
identification and intervention.
12.4.4╇ Limitations
The data have several limitations that affected the results.
First, the Trac data do not capture all group communication. In addition to Trac, students
collaborated and communicated via other media to which we do not have access, such as
face-to-face meetings twice a week, instant messaging, telephone conversations, SMS, and
others. An addition of a chat tool to Trac can capture some of this communication and also
increase the use of Trac.
Second, the data mining results revealed insufficient use of Trac by some of the groups
(e.g., Group 2) and subsequent interviews revealed that they were reluctant to use Trac as
they felt it was cumbersome and preferred to communicate by other means.
Third, there were not enough instances for the cluster analysis, especially for the cluster-
ing of groups. Although our data contained more than 15,000 events, there were only for 7
groups and 43 students to cluster. Nevertheless, we think that clustering by using the col-
lected data and selected attributes allowed for uncovering useful patterns, characterizing
the work of stronger and weaker students as already discussed. The follow-up interviews
were very helpful for interpreting and validating the patterns.
12.5╇ Conclusions
This chapter describes our work on mining of student group interaction data. Our goal
was to improve teaching and learning of group work skills in the context of a capstone
software development project course. Students used Trac, a state-of-the art collaborative
platform for software development, which includes wiki for shared web pages, ticket
task management system, and software repository system. We applied three data mining
approaches: mirroring visualization, sequential pattern mining, and clustering. We found
that they enabled us to achieve the goals of early identification of problems in groups,
improving monitoring and self-monitoring of student progress throughout the semester,
and helping students to learn to work effectively in groups.
184 Handbook of Educational Data Mining
More specifically, mirroring visualizations such as Wattle trees and interaction networks
helped us to understand how well individual members are contributing to the group and
how well this visualization can be used for reflective activities in class. The clustering and
sequential mining results revealed interesting patterns distinguishing between the strong
and weak groups, and we also found that some of them can be mined from early data, in time
to rectify problems. The next step will be to automate the discovery of patterns during the
semester, match them with the patterns we validated, and present regular formative feedback
to students, including useful links and remedial exercises. Future work will also include sys-
tematic evaluation of the impact of the results on student learning in future cohorts.
Our work also highlighted some of the challenges in analyzing and visualizing educa-
tional data. Educational data is temporal, noisy, and lacking enough data for some tasks.
For example, one of the groups was reluctant to use the ticket system, which resulted in
non-representative ticket data; there were not enough samples for the clustering of the
groups (i.e., many events but a small number of students and groups). Good understand-
ing of the educational domain and data and using suitable preprocessing for data abstrac-
tion, representation, and feature selection are critical for the success.
Our approach and results may be valuable not only for software development projects
using online collaboration systems similar to Trac, but also to the much broader area of
Computer Supported Collaborative Learning, where teachers need to address many of the
same concerns that were drivers for this work, e.g., to identify groups that are functioning
poorly, whether individuals are not contributing or are doing so in ways that do not match
their assigned group role and responsibilities.
References
1. Salas, E., D.E. Sims, and C.S. Burke, Is there a “Big Five” in teamwork? Small Group Research 36:
555–599, 2005.
2. XP—Extreme Programming. [cited 2009]; Available from: www.extremeprogramming.org
3. Talavera, L. and E. Gaudioso, Mining student data to characterize similar behavior groups in
unstructured collaboration spaces, in Proceedings of European Conference on Artificial Intelligence,
Valencia, Spain, 2004, pp. 17–23.
4. Anaya, A.R. and J.G. Boticario, A data mining approach to reveal representative collabora-
tion indicators in open collaboration frameworks, in Proceedings of Educational Data Mining
Conference, Cordoba, Spain, 2009, pp. 210–219.
5. Prata, D.N., R.S.d. Baker, E.d.B. Costa, C.P. Rose, Y. Cui, and A.M.J.B.d. Carvalho, Detecting
and understanding the impact of cognitive and interpersonal conflict in computer supported
collaborative environments, in Proceedings of 2nd International Conference on Educational Data
Mining, T. Barnes, M. Desmarais, C. Romero, and S. Ventura, Editors, Cordoba, Spain, 2009,
pp. 131–140.
6. Soller, A., Computational modeling and analysis of knowledge sharing in collaborative dis-
tance learning. User Modeling and User-Adapted Interaction 14(4): 351–381, 2004.
7. Barros, B. and M.F. Verdejo, Analysing student interaction processes in order to improve col-
laboration. The DEGREE approach. International Journal of Artificial Intelligence in Education 11:
221–241, 2000.
8. Jermann, P., A. Soller, and M. Muehlenbrock, From mirroring to guiding: A review of state of the
art technology for supporting collaborative learning, in Proceedings of 1st European Conference on
Computer-Supported Collaborative Learning, Maastricht, the Netherlands, 2001, pp. 324–331.
Educational Data Mining to Support Group Work in Software Development Projects 185
Contents
13.1 Introduction......................................................................................................................... 187
13.2 Multi-Instance Learning.................................................................................................... 188
13.2.1 Definition and Notation of Multi-Instance Learning........................................ 188
13.2.2 Literature Review of Multi-Instance Learning................................................... 189
13.3 Problem of Predicting Students’ Results Based on Their Virtual Learning
Platform Performance........................................................................................................ 190
13.3.1 Components of the Moodle Virtual Learning Platform................................... 190
13.3.2 Representation of Information for Working with Machine
Learning Algorithms............................................................................................. 191
13.4 Experimentation and Results............................................................................................ 192
13.4.1 Problem Domain Used in Experimentation....................................................... 193
13.4.2 Comparison with Supervised Learning Algorithms........................................ 194
13.4.3 Comparison with Multi-Instance Learning........................................................ 194
13.4.4 Comparison between Single- and Multi-Instance Learning............................ 197
13.5 Conclusions and Future Work.......................................................................................... 198
References...................................................................................................................................... 198
13.1╇ Introduction
Advances in technology and the impact of the Internet in the last few years have both
affected all aspects of our lives. In particular, the implications in educational circles are
of an incalculable magnitude, making the relationship between technology and educa-
tion more and more obvious and necessary. In this respect, it is important to mention the
appearance of the virtual learning environment (VLE) or e-learning platforms [1]. These
systems can potentially eliminate barriers and provide flexibility, constantly updated
material, student memory retention, individualized learning, and feedback superior to
the traditional classroom [2]. The design and implementation of e-learning platforms have
grown exponentially in the last few years becoming an essential accessory to support both
face-to-face classrooms and distance learning. The use of these applications accumulates a
great amount of information, because they can record all the information about students’
actions and interactions in log files and data sets.
187
188 Handbook of Educational Data Mining
Ever since this problem was identified, there has been a growing interest in analyz-
ing this valuable information to detect possible errors, shortcomings, and improvements
in student performance, and discover how the student’s motivation affects the way he
or she interacts with the software. Promoted by this appreciable interest, a consider-
able number of automatic tools that make possible to work with vast quantities of data
have appeared. Fausett and Elwasif [3] predicted students’ grades (classified into five
classes: A, B, C, D, and E or F from test scores using neural networks); Martínez [4]
predicted student academic success (classes that are successful or not) using discrimi-
nant function analysis; Minaei-Bidgoli and Punch [5] classified students by using genetic
algorithms to predict their final grade; Superby et al. [6] predicted a student’s academic
success (classified into low, medium, and high risk classes) using different data mining
methods; Kotsiantis and Pintelas [7] predicted a student’s marks (pass and fail classes)
using regression techniques in Hellenic Open University data; and Delgado et al. [8]
used neural network models from Moodle logs. Further information can be found in a
very complete survey developed by Romero and Ventura [9]; it provides a good review
of the main works (from 1995 to 2005) using data mining techniques grouped by task in
e-learning environments.
The main property that all studies share about solving this problem to date is to use
a traditional learning perspective with a single-instance representation. However, the
essential particularity when facing this problem is that the information is incomplete
because each course has different types and numbers of activities, and each student car-
ries out the number of activities considered most interesting, dedicating more or less time
to resolve them. Multi-instance learning (MIL) allows a more appropriate representation
of this information by storing the general information of each pattern by means of bag
attributes and specific information about the student’s work of each pattern by means of
a variable number of instances. This paper presents both traditional supervised learning
representation and the first proposal to work in a MIL scenario. Algorithms of the most
representative paradigms in both traditional supervised learning and MIL are compared.
Experimental results show how MIL is more effective in obtaining more accurate models
as well as a more optimized representation.
The chapter is organized as follows. Section 13.2 introduces MIL and its main definitions
and techniques. Section 13.3 presents the problem of classifying students’ performance
from single-instance and multi-instance perspectives. Section 13.4 reports experimental
results for all the algorithms tested and compares them. Finally, Section 13.5 summarizes
the main contributions of this paper and raises some future research directions.
approaches differ in the available labels from which they learn. In a traditional supervised
learning setting, an object mi is represented by a feature vector vi, which is associated with
a label f(mi↜). However, in the multi-instance setting, each object mi may have Vi various
instances denoted mi,1, mi,2, …, mi,vi. Each of these variants will be represented by a (usually)
distinct feature vector V(mi,j↜). A complete training example is therefore written as (V(mi,1↜),
V(mi,2↜), …, V(mi,vi↜), f(mi↜)).
The goal of learning is to find a good approximation of function f(mi↜), fâ•›′(mi↜), analyz-
ing a set of training examples and labeled by f(mi↜). To obtain this function, Dietterich
defines an hypothesis that assumes that if the result observed is positive, then at least one
of the variant instances must have produced that positive result. Furthermore, if the result
observed is negative, then none of the variant instances could have produced a positive
result. This can be modeled by introducing a second function g(V(mi,j↜)) that takes a single
variant instance and produces a result. The externally observed result, f(mi↜), can then be
defined as follows:
In the early years of research on MIL, all multi-instance classification works were based on
this assumption, which is known as the standard multi-instance or Dietterich hypothesis.
More recently, generalized MIL models have been formalized [11,12] where a bag is quali-
fied to be positive if instances in the bag satisfy some sophisticated constraints other than
simply having at least one positive instance.
TABLE 13.1
Information Summary Considered in Our Study
Attributes
Activity Name Description
Assignment numberAssignment Number of practices/tasks done by the user in the course
timeAssignment Total time in seconds that the user has been in the assignment section
Forum numberPosts Number of messages sent by the user forum
numberRead Number of messages read by the user forum
timeForum Total time in seconds that the user has been in the forum section
Quiz numberQuiz Number of quizzes seen by the user
numberQuiz_a Number of quizzes passed by the user
numberQuiz_s Number of quizzes failed by the user
timeQuiz Total time in seconds that the user has been in the quiz section
A summary of the information considered for each activity in our study is shown in
Table 13.1.
Bag Instance
FIGURE 13.1
Information about bags and information about instances.
QUIZ_P, number of quizzes passed by the student; QUIZ_F, number of quizzes failed by
the student; QUIZ, referring to the times the student has visited a survey without actu-
ally answering it; FORUM_POST, number of messages that the student has submitted;
FORUM_READ, number of messages that the student has read; and FORUM, referring to
the times the student has seen different forums without entering them. In addition, the
bag contains three attributes: student identification, course identification, and the final
mark obtained by the student in that course.
This information could be represented in a natural way from the MIL perspective. It is
a flexible representation where new activities can be added without affecting the patterns
that do not consider this new type of activity. The information about the types of activi-
ties carried out is stored as instances, and the number of instances per student is variable.
Thus, activities that are not very common in the courses could be studied without increas-
ing the general information about each pattern.
Figure 13.2 shows available information about two students. Figure 13.2a shows the
information according to traditional supervised learning; each student is a pattern that
contains all the information considered, even though this student may not have actually
done any type of activity. Figure 13.2b and c shows the information according to MIL rep-
resentation. Figure 13.2b shows a user who has carried out only one, and we can see the
information that belongs to the bag or the instance in each case. Figure 13.2c shows a user
that has carried out all the activities along with his or her information with respect to the
bag and instances.
Bag
Instance
User_id: 1
Course: 1 typeActivity: QUIZ_F
Attributes Student1 Student2 markFinal: fail numberActivities: 1
timeOfActivities: 450
User_id 1 2
Course 1 1
n_assigment 0 3 (b)
Total_time_assigment 0 8709
n_posts 0 5
n_read 0 20
Total_time_forum 0 1034 Bag
n_quiz 1 8 Instance Instance
n_quiz_a 0 5 User_id: 2
Course: 1 typeActivity: ASSIGMENT_S typeActivity: QUIZ_P
n_quiz_s 1 3 markFinal: pass numberOfActivities: 3 numberOfActivities: 5
Total_time_quiz 450 19809 timeOfActivities: 518 timeOfActivities: 19809
Final_mark Fail Pass Instance Instance Instance
(a) typeActivity: QUIZ_F typeActivity: FORUM_READ typeActivity: FORUM_POST
numberOfActivities: 3 numberOfActivities: 20 numberOfActivities: 5
timeOfActivities: 8709 timeOfActivities: 634 timeOfActivities: 1034
(c)
FIGURE 13.2
Information about two students (a) Available information for student 1 and student 2 to traditional supervised
learning. (b) Information about bag and instances for student 1 to MIL. (c) Information about bag and instances
for student 2 to MIL.
for solving this problem. Finally, single-instance and multi-instance proposals are com-
pared. All experiments are carried out using 10-fold stratified cross validation, [30] and the
average values of accuracy, sensitivity, and specificity are reported below. These measure-
ments are widely used in classification and evaluate different parts of a classifier; therefore,
an acceptable classifier must obtain an appropriate value in each one of them. Accuracy is
the proportion of cases correctly identified, sensitivity is the proportion of cases correctly
identified as meeting a certain condition, and specificity is the proportion correctly identi-
fied as not meeting a certain condition.
In this section, first the problem domain is described briefly and then the results are
shown and discussed.
TABLE 13.2
General Information about the Courses
Course Number of Number of Number of Number of
Identifier Students Assignments Forums Quizzes
ICT-29 118 11 2 0
ICT-46 9 0 3 6
ICT-88 72 12 2 0
ICT-94 66 2 3 31
ICT-110 62 7 9 12
ICT-111 13 19 4 0
ICT-218 79 4 5 30
TABLE 13.3
Results for Supervised Learning Algorithms
Algorithms Accuracy Sensitivity Specificity
Trees DecisionStump 0.6690 0.8889 0.3651
RandomForest 0.6667 0.7573 0.5426
RandomTree 0.6476 0.6996 0.5755
J48 graft 0.6881 0.7950 0.5408
J48 0.6857 0.7950 0.5345
Rules NNge 0.6952 0.7329 0.6434
Ridor 0.6810 0.8648 0.4310
OneR 0.6476 0.7665 0.4835
ZeroR 0.5810 1.0000 0.0000
Naive Bayes NaiveBayes 0.6857 0.8232 0.4944
NaiveBayesMultinomial 0.6929 0.7662 0.5918
NaiveBayesSimple 0.6810 0.8232 0.4832
NaiveBayesUpdateable 0.6857 0.8232 0.4944
Others RBFNetwork 0.6929 0.8227 0.5114
SMO 0.6976 0.8842 0.4374
ZeroR
OneR
RandomTree
RandomForest
DecisionStump
NaiveBayesSimple
Ridor
NaiveBayesUpdateable
NaiveBayes
J48
J48graft
RBFNetwork
NaiveBayesMultinomial
NNge
SMO
0.5 0.55 0.6 0.65 0.7 0.75 0.8
FIGURE 13.3
Accuracy obtained by single-instance algorithms.
TABLE 13.4
Results for Multi-Instance Learning Algorithms
Algorithm Accuracy Sensitivity Specificity
Methods based on supervised PART 0.7357 0.8387 0.5920
learning (Simple) AdaBoostM1&PART 0.7262 0.8187 0.5992
Methods based on supervised Bagging&PART 0.7167 0.7733 0.6361
learning (Wrapper) AdaBoostM1&PART 0.7071 0.7735 0.6136
PART 0.7024 0.7857 0.5842
SMO 0.6810 0.8644 0.4270
NaiveBayes 0.6786 0.8515 0.4371
Methods based on distance MIOptimalBall 0.7071 0.7218 0.6877
CitationKNN 0.7000 0.7977 0.5631
Methods based on trees DecisionStump 0.6762 0.7820 0.5277
RepTree 0.6595 0.7127 0.5866
Logistic regression MILR 0.6952 0.8183 0.5218
Methods based on diverse MIDD 0.6976 0.8552 0.4783
density MIEMDD 0.6762 0.8549 0.4250
MDD 0.6571 0.7864 0.4757
optimize by using multi-objective strategies. With respect to the different paradigms used,
the methods based on rules (PART) or a combination of this method with other proposals
obtain the best results for this learning. The results of accuracy obtained by each algo-
rithm are shown in Figure 13.4. In this figure, the algorithms are sorted according to their
values with respect to accuracy measurement. Thus, the differences between algorithms
are easier to understand, and we can check that methods based on systems based on rules
as PART achieve the best results. Moreover, these methods add comprehensibility to the
problem generating rules that a user can interpret easily.
MDD
RepTree
MIEMDD
SMO (simple)
NaiveBayes (simple)
DecisionStump
MIDD
MILR
CitationKNN
PART (simple)
AdaBoostM1&PART (simple)
MIOptimalBall
Bagging&PART (simple)
AdaBoostM1&PART (wrapper)
PART (wrapper)
0.5 0.55 0.6 0.65 0.7 0.75 0.8
FIGURE 13.4
Accuracy obtained by multi-instance algorithms.
Multi-Instance Learning versus Single-Instance Learning 197
It can be seen that some of the paradigms used in single-instance and multi-instance
representation are similar in both proposals. However, these methods are not comparable
directly because they are not exactly the same implementations. Therefore, it is neces-
sary to evaluate the results obtained by each representation considering the different algo-
rithms evaluated to draw a final conclusion (in Section 13.4.4 we perform the comparison).
At first sight, it is interesting to see in the graphics showed that generally the algorithms
with multi-instance representation yield higher values of accuracy.
TABLE 13.5
Sum of Ranks and Mean Rank of the Two Representations
Mean Rank Sum of Ranks
Multi-instance representation 18.67 280
Single-instance representation 12.33 185
198 Handbook of Educational Data Mining
TABLE 13.6
Wilcoxon Rank–Sum Test Results
Asymp Sig (2-Tailed)
Wilcoxon W Z-Score (â•›p-Value)
Accuracy for both 185 −1.973 0.048
representations
References
1. Nagi, K. and Suesawaluk, P. Research analysis of Moodle reports to gauge the level of inter-
activity in eLearning courses at Assumption University, Thailand. In ICCN’08: Proceedings of
the 18th International Conference on Computer and Communication Engineering, Kuala Lumpur,
Malaysia, 2008.
2. Chou, S. and Liu, S. Learning effectiveness in Web-based technology-mediated virtual learn-
ing environment. In HICSS’05: Proceedings of the 38th Hawaii International Conference on System
Sciences, Washington, DC, January 3–6, 2005.
3. Fausett, L. and Elwasif, W. Predicting performance from test scores using backpropagation and
counterpropagation. In WCCI’94: IEEE World Congress on Computational Intelligence, Orlando,
FL, 1994, pp. 3398–3402.
4. Martínez, D. Predicting student outcomes using discriminant function analysis. In Annual
Meeting of the Research and Planning Group, San Francisco, CA, 2001, pp. 163–173.
5. Minaei-Bidgoli, B. and Punch, W. Using genetic algorithms for data mining optimization in an
educational Web-based system. Genetic and Evolutionary Computation, 2, 2252–2263, 2003.
6. Superby, J.F., Vandamme, J.P., and Meskens, N. Determination of factors influencing the achieve-
ment of the first-year university students using data mining methods. In EDM’06: Workshop on
Educational Data Mining, Hong Kong, China, 2006, pp. 37–44.
7. Kotsiantis, S.B. and Pintelas, P.E. Predicting students marks in Hellenic Open University. In
ICALT’05: The 5th International Conference on Advanced Learning Technologies, Kaohsiung, Taiwan,
July 5–8, 2005, pp. 664–668.
Multi-Instance Learning versus Single-Instance Learning 199
8. Delgado, M., Gibaja, E., Pegalajar, M.C., and Pérez, O. Predicting students’ marks from Moodle
Logs using neural network models. In Current Developments in Technology-Assisted Education,
FORMATEX, Badajoz, Spain, 2006, pp. 586–590.
9. Romero, C. and Ventura, S. Educational data mining: A survey from 1995 to 2005. Expert Systems
with Applications, 33(1), 135–146, 2007.
10. Dietterich, T.G., Lathrop, R.H., and Lozano-Perez, T. Solving the multiple instance problem
with axis-parallel rectangles. Artificial Intelligence, 89(1–2), 31–71, 1997.
11. Weidmann, N., Frank, E., and Pfahringer, B. A two-level learning method for generalized multi-
instance problems. In ECML’03: Proceedings of the 14th European Conference on Machine Learning,
Cavtat-Dubrovnik, Croatia, 2003, pp. 468–479.
12. Scott, S., Zhang, J., and Brown, J. On generalized multiple-instance learning. International
Journal of Computational Intelligence and Applications, 5, 21–35, 2005.
13. Andrews, S., Tsochantaridis, I., and Hofmann, T. Support vector machines for multiple-instance
learning. In NIPS’02: Proceedings of Neural Information Processing System, Vancouver, Canada,
2000, pp. 561–568.
14. Pao, H.T., Chuang, S.C., Xu, Y.Y., and Fu, H. An EM based multiple instance learning method
for image classification. Expert Systems with Applications, 35(3), 1468–1472, 2008.
15. Chen, Y. and Wang, J.Z. Image categorization by learning and reasoning with regions. Journal of
Machine Learning Research, 5, 913–939, 2004.
16. Maron, O. and Lozano-Pérez, T. A framework for multiple-instance learning. In NIPS’97:
Proceedings of Neural Information Processing System 10, Denver, CO, 1997, pp. 570–576.
17. Zhang, Q. and Goldman, S. EM-DD: An improved multiple-instance learning technique. In
NIPS’01: Proceedings of Neural Information Processing System 14, Vancouver, Canada, 2001, pp.
1073–1080.
18. Han, Q.Y. Incorporating multiple SVMs for automatic image annotation. Pattern Recognition,
40(2), 728–741, 2007.
19. Zafra, A., Ventura, S., Romero, C., and Herrera-Viedma, E. Multi-instance genetic program-
ming for web index recommendation. Expert System with Applications, 36, 11470–11479, 2009.
20. Goldman, S.A. and Scott, S.D. Multiple-instance learning of real valued geometric patterns.
Annals of Mathematics and Artificial Intelligence, 39, 259–290, 2001.
21. Ruffo, G. Learning single and multiple instance decision tree for computer security applica-
tions. PhD dissertation, Department of Computer Science, University of Turin, Torino, Italy,
2000.
22. Wang, J. and Zucker, J.-D. Solving the multiple-instance problem: A lazy learning approach. In
ICML’00: Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, 2000,
pp. 1119–1126.
23. Chevaleyre, Y.-Z. and Zucker, J.-D. Solving multiple-instance and multiple-part learning prob-
lems with decision trees and decision rules. Application to the mutagenesis problem. In AI’01:
Proceedings of the 14th of the Canadian Society for Computational Studies of Intelligence, LNCS 2056,
Ottawa, Canada 2001, pp. 204–214.
24. Ramon, J. and De Raedt, L. Multi-instance neural networks. In ICML’00: A Workshop on Attribute-
Value and Relational Learning at the 17th Conference on Machine Learning, San Francisco, CA, 2000.
25. Chai, Y.M. and Yang, Z.-W. A multi-instance learning algorithm based on normalized radial
basis function network. In ISSN’07: Proceedings of the 4th International Symposium on Neural
Networks. LNCS 4491, Nanjing, China, 2007, pp. 1162–1172.
26. Xu, X. and Frank, E. Logistic regression and boosting for labelled bags of instances. In
PAKDD’04: Proceedings of the 8th Conference of Pacific-Asia. LNCS 3056, Sydney, Australia, 2004,
pp. 272–281.
27. Zhou, Z.-H. and Zhang, M.-L. Solving multi-instance problems with classifier ensemble based
on constructive clustering. Knowledge and Information Systems, 11(2), 155–170, 2007.
28. Zafra, A. and Ventura, S. Multi-objective genetic programming for multiple instance learning.
In EMCL’07: Proceedings of the 18th European Conference on Machine Learning, LNAI 4701, Warsaw,
Poland, 2007, pp. 790–797.
200 Handbook of Educational Data Mining
29. Rice, W.H. Moodle e-Learning Course Development. A Complete Guide to Successful Learning Using
Moodle. Pack Publishing, Birmingham, U.K., 2006.
30. Wiens, T.S., Dale, B.C., Boyce, M.S., and Kershaw, P.G. Three way k-fold cross-validation of
resource selection functions. Ecological Modelling, 3–4, 244–255, 2007.
31. Witten, I.H. and Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn.
Morgan Kaufmann, San Francisco, CA, 2005.
32. Breiman, L. Random forests. Machine Learning, 45(1), 5–32, 2001.
33. Quinlan, R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA,
1993.
34. Webb, G. Decision tree grafting from the all-tests-but-one partition. In IJCAI’99: Proceedings
of the 16th International Joint Conference on Artificial Intelligence, San Francisco, CA, 1999, pp.
702–707.
35. Martin, B. Instance-based learning: Nearest neighbor with generalization. PhD thesis,
Department of Computer Science. University of Waikato, Hamilton, New Zealand, 1995.
36. Gaines, B.R. and Compton, P. Induction of ripple-down rules applied to modeling large data-
bases. Journal of Intelligence Information System, 5(3), 211–228, 1995.
37. Holte, R.C. Very simple classification rules perform well on most commonly used datasets.
Machine Learning, 11, 63–91, 1993.
38. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., and Murthy, K.R.K. Improvements to Platt’s
SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637–649, 2001.
39. Duda, R. and Hart, P. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
40. George, H.J. and Langley, P. Estimating continuous distributions in Bayesian classifiers. In
UAI’95: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, San Mateo, CA,
1995, pp. 338–345.
41. McCallum, A. and Nigam, K. A comparison of event models for naive Bayes text classification.
In AAAI’98: Workshop on Learning for Text Categorization, Orlando, FL, 1998, pp. 41–48.
42. Tan, K.C., Tay, A., Lee, T., and Heng, C.M., Mining multiple comprehensible classification
rules using genetic programming. In CEC’02: Proceedings of the 2002 Congress on Evolutionary
Computation, Honolulu, HI, 2002, Vol. 2, pp. 1302–1307.
43. Xu, X. Statistical learning in multiple instance problems. PhD thesis, Department of Computer
Science, University of Waikato, Hamilton, New Zealand, 2003.
44. Gärtner, T., Flach, P.A., Kowalczyk, A., and Smola, A.J. Multi-instance kernels. In ICML’02:
Proceedings of the 19th International Conference on Machine Learning, Morgan Kaufmann, Sydney,
Australia, 2002, pp. 179–186.
45. Auer, P. and Ortner, R. A boosting approach to multiple instance learning. In ECML’04:
Proceedings of the 5th European Conference on Machine Learning. Lecture Notes in Computer Science,
Pisa, Italy, 2004, Vol. 3201, pp. 63–74.
46. Frank, E. and Xu, X. Applying propositional learning algorithms to multi-instance data,
Technical report, Department of Computer Science, University of Waikato, 2003.
47. Demsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research, 17, 1–30, 2006.
14
A Response-Time Model for Bottom-Out
Hints as Worked Examples
Contents
14.1 Introduction......................................................................................................................... 201
14.2 Background.......................................................................................................................... 202
14.3 Data....................................................................................................................................... 203
14.4 Model.................................................................................................................................... 204
14.5 Results.................................................................................................................................. 207
14.6 Conclusions and Future Work.......................................................................................... 209
References...................................................................................................................................... 210
14.1╇ Introduction
Students sometimes use an educational system’s help in unexpected ways. For example,
they may bypass abstract hints in search of a concrete example. This behavior has tra-
ditionally been labeled as gaming or help abuse. In this chapter, we propose that some
examples of this behavior are not abusive and that bottom-out hints can serve as worked
examples. To demonstrate this, we create a model for distinguishing good bottom-out hint
use from bad bottom-out hint use by analyzing logged response times. This model not
only predicts learning, but captures behaviors related to self-explanation.
There is a large body of research on measuring students’ affective and metacognitive
states. The goals of this research range from adapting instruction to individual needs to
designing interventions that change affective states. One technique is to build classifica-
tion models for affective states using tutor interaction data, often with student response
times as an independent variable [3,5,6,13]. Several lines of research using this approach
have targeted tutor help abuse as a behavior negatively correlated with learning [3,13],
but these classification rules for help abuse can be quite broad and thus proscribe behav-
iors that may actually benefit learning. For example, students that drill down to the most
detailed hint may not be gaming the system, but instead searching for a worked example.
The goal of this chapter is to show that a simple response-time-based indicator can discern
some good bottom-out hint behaviors. We also provide a case study on the construction of
models for unobserved states or traits.
An indicator for good bottom-out hint behaviors would be useful in several ways. It
could indicate general traits, such as good metacognitive self-regulation, or it might serve
as a proxy for students’ affective states. In general, even an indicator that does not directly
201
202 Handbook of Educational Data Mining
14.2╇ Background
Usually, a response-time is defined as the time required for a subject to respond to a stimu-
lus. For educational systems, this is often calculated at the transaction level. How long
does it take a student to perform a task, such as request help or enter an answer? The
limitations of log data mean that many details are obscure, such as the student’s thought
process or the presence of distractions. The student’s actions may have been triggered by
an event outside the systems’ purview or the student may be engaging in other off-task
behavior. While a limitation, it is exactly this off-task aspect of response times that makes
them so valuable for estimating students’ affective states [6].
There is a large body of prior work on response-time-based models for affective
and metacognitive states. For example, Beck modeled student disengagement using a
response-time-based model [6]. His work is particularly relevant because, like this study,
he used a combination of model building and elimination of irrelevant examples to con-
struct his detector. There are also a number of help-seeking models that utilize response
times in their design to better detect gaming behavior, particularly behaviors that involve
drilling through scaffolding or repeated guessing [3,13]. Still, we know of no examples of
models that can detect help abuse aimed at soliciting a worked example. To the contrary,
one rule shared by many models is that bypassing the traditional scaffold structure to
retrieve the bottom-out hint, which is often the final answer, is considered an undesirable
behavior [3].
In the literature on worked examples, particularly, for self-explanation, help abuse is
not always bad for learning. If a student truly needs a worked example, drilling down
through the help system may be the easiest available means of getting one. There is a
significant body of research on worked examples to suggest that this is true [11]. There
have also been attempts to apply worked examples to computerized educational systems
[4,9]. For this chapter, however, the focus is restricted to considering worked examples
A Response-Time Model for Bottom-Out Hints as Worked Examples 203
14.3╇ Data
Our data consists of logs of student interactions with the Geometry Cognitive Tutor [1]. In
the tutor, students are presented with a geometry problem displayed above several empty
text fields. A step in the problem requires filling in a text field. The fields are arranged
systematically on each problem page and might, for example, ask for the values of angles
in a polygon or for the intermediate values required to calculate the circumference of a
circle. In the tutor, a transaction is defined as interacting with the glossary, requesting a
hint, entering an answer, or pressing “done.” The done transaction is only required to start
a new problem, not to start a new step.
The data itself comes from Aleven and Koedinger [1]. They studied the addition of
required explanation steps to the Geometry Cognitive Tutor. In the experimental con-
dition, after entering a correct answer, students were asked to justify their answer by
citing a theorem. This could be done either by searching the glossary or by directly
inputting the theorem into a text field. The tutor labeled the theorem as correct if it was
a perfect, letter-by-letter match for the tutor’s stored answer. This type of reasoning or
justification transaction is the study’s version of self-explanation (albeit with feedback).
In their study, Aleven and Koedinger found that students in the experimental condi-
tion “learned with greater understanding compared to students who did not explain
steps” [1].
The hints are arranged in levels, with each later level providing a more specific sugges-
tion about how to proceed on that step. While the early hint levels can be quite abstract, the
bottom-out hint does everything short of entering the answer. The only required work to
finish a step after receiving a bottom-out hint is to input the answer itself into the text field.
This type of transaction is the study’s version of a simple worked example.
There were 39 students in the study that had both pretest and posttest scores. They
were split into 20 and 19 between the two conditions. All times were measured in hun-
dredth’s of a second. The other details of the data are introduced in the discussion of
results, Section 4.5.
204 Handbook of Educational Data Mining
14.4╇ Model
The core assumption underlying this work is that bottom-out hints can serve as worked
examples. In our data, a bottom-out hint is defined to be the last hint available for a
given step. The number of hints available differs from step to step, but the frequency of
bottom-out hint requests suggests that students were comfortable drilling down to the
last hint.
Assuming that bottom-out hints sometimes serve as worked examples, there are a cou-
ple of ways to detect desirable bottom-out hint behaviors. The first method is to detect
when a student’s goal is to retrieve a worked example. This method is sensitive to prop-
erly modeling student intentions. The second method, which is the focus of this work, is
to detect when a student is learning from a bottom-out hint, regardless of their intent in
retrieving the hint. This assumes that learning derives from the act of self-explanation or
through a similar mechanism, and can occur even if the student purposefully gamed the
system to retrieve the bottom-out hint. We estimate the amount of learning by using the
time spent thinking about a hint. This makes the approach dependent only on student’s
time spent and independent of a student’s intention.
To detect learning from bottom-out hints, our model requires estimates of the time stu-
dents spend thinking about hints. Call the hint time HINTtâ•›, where HINTt is the time spent
thinking about a hint requested on transaction t. Estimating HINTt is nontrivial. Students
may spend some of their time engaged in activities unobserved in the log, like chatting
with a neighbor. However, even assuming no external activities or any off-task behavior
whatsoever, HINTt is still not directly observable in the data.
To illustrate this, consider the following model for student cognition. On an answer
transaction, the student first thinks about an answer, then enters an answer, and then
reflects on their solution. Call this model TER for think–enter–reflect. An illustration of
how the TER model would underlie observed data is shown in Figure 14.1. The reflec-
tion time for the second transaction is part of the logged time for the third transaction.
Under the TER model, the reflection time for one transaction is indistinguishable from the
thinking and entry times associated with the next transaction. Nevertheless, we need an
estimate for the think and reflect times to approximate student learning from bottom-out
hints.
The full problem, including external factors, is illustrated in Table 14.1. It shows a series
of actual student transactions on a pair of problem steps along with hypothetical, unob-
served student cognition. Entries in italics are observed in the log, while those in normal
face are unobserved, and ellipses represent data not relevant to the example. The time
the student spends thinking and reflecting on the bottom-out hint is about 6â•›s, but the
available observed durations are 0.347, 15.152, and 4.944â•›s. In a case like this, the log data’s
observed response times include a mixture of think and reflect times across multiple steps.
Unfortunately, while the reflection time is important for properly estimating HINTtâ•›, it is
Figure 14.1
TER model.
A Response-Time Model for Bottom-Out Hints as Worked Examples 205
Table 14.1
Hypothetical Student Transactions (Observed Data, Unobserved Data)
Step Transaction Observed T Cognition Hypothetical T
Step 1 Request hint 0.347 … …
Thinking about hint 3
Looking at clock 2
Thinking about date 9
Step 1 Enter answer 15.152 Typing answer 1.152
Thinking about last step 2
Thinking about next step 1
Step 2 Enter answer 4.944 Typing answer 1.944
categorized improperly. The reflection time for transaction t is actually part of the logged
time for transaction (tâ•›+â•›1), that is, after receiving a hint, thinking about the hint, and enter-
ing the answer, the tutor records a time stamp, but the student may continue thinking
about the hint as long as they wish. Teasing apart these transaction times requires some
creative rearranging of terms. Rather than estimating HINTt directly, we will estimate a
number of other values and calculate HINTt accordingly.
The first part of our model separates out two types of bottom-out hint cognition: think
and reflect. Thinking is defined as all hint cognition before entering the answer; reflecting
is all hint cognition after entering the answer. Let think time be denoted by Kt and reflect
time be denoted by Rt. We define HINTt = Ktâ•›+â•›Rt.
The task reduces to estimating Kt and Rt. As shown earlier, this can be difficult for an
arbitrary transaction. However, the task is easier if we restrict our focus to bottom-out
hints. Let the time to enter an answer be denoted by Et. Table 14.2 provides an example
of how bottom-out hints differ from other transactions: the absence of a reflect time,
Rtâ•›−â•›1, in the bottom-out case. Except for the time spent on answer entry and the time
spent off-task, the time between receiving a hint and entering the answer is Kt. This is
because, after receiving a bottom-out hint, a student does not need to engage in reflec-
tion about the step before, eliminating the confound Rtâ•›−â•›1. A similar, but slightly more
complicated result applies to Rt. For now, assume the off-task time is zero—it will be
properly addressed later. Let the total time for a transaction be Tt. Then the equation for
HINTt becomes
Table 14.2
TER Model with Estimators
Transaction Cognition Notation
Hint …
Think about step Kt
Off-task Et
Enter answer Enter answer
Reflect on previous step Rt
Think about new step Ktâ•›+â•›1
Off-task Etâ•›+â•›1
Enter answer Enter answer
206 Handbook of Educational Data Mining
HINTt = Kt + Rt
= (Tt − Et ) + Rt
(
= (Tt − Et ) + Tt + 1 − ( Kt + 1 + Et + 1 ) )
where Tt and Ttâ•›+â•›1 are the observed times in the log data. The first term consists of replac-
ing Kt with measured and unmeasured times from before the answer is submitted (row 2
of Table 14.2). The second term consists of times from after the answer is submitted (row
3 of Table 14.2). We have reduced the problem to something tractable. Now, if we have an
estimate for Et, we can estimate Kt. Similarly, if we have an estimate for Ktâ•›+â•›1 and Etâ•›+â•›1, we
can estimate Rt.
Constructing reliable estimates for the above values is impossible on a per transaction
basis. However, if we aggregate across all transactions performed by a given student, then
the estimators become feasible. There are two important points regarding the estimators
we will use. First, response times, because of their open-ended nature, are extremely prone
to outliers. For example, the longest recorded transaction is over 25â•›min in length. Thus,
we require our estimators to be robust. Second, some students have relatively few (≈10)
bottom-out hint transactions that will fit our eventual criteria, so our estimators must con-
verge quickly.
Now we require new notations. We use the s subscript, where s represents a student.
We also use the Ês notation for estimators and the m(Et) notation for medians. You can
think of the median of Et, m(Et), as approximating the mean, but always use the median
because of outliers. Êss, the per student estimator, will represent some measure of the
usual Et for a student s. Also, let A be the set of all transactions and As be the set of
all transactions for a given student. We will further subdivide these sets into correct
answers transactions immediately following a bottom-out hint and transactions follow-
ing those transactions. These two types of transactions are generalizations of the last
two transactions shown in Table 14.2. Let As1 be the set of all correct answer transac-
tions by a student s where the transaction immediately follows a bottom-out hint. These
are transactions corresponding to row 2 of Table 14.2. Similarly, let As2 be the set of all
transactions that follow a transaction t ∈ As1. These are transactions corresponding to
row 3 of Table 14.2. Essentially, a bottom-out hint transaction is followed by a transac-
tion of type As1, which is in turn followed by a transaction of type As2 . For convenience,
we will let Ts1 = m ( Tt ) , t ∈ As1 be the median time for transactions of the first type and
Ts2 = m ( Tt ) , t ∈ As2 be the median time for transactions of the second type. This gives an
equation for our estimator:
HINT ( ) ( (
ˆ S = Ts1 − Es + Ts2 − K s2 + Es ))
where K s2 = m ( Kt ) , t ∈ As2 is the thinking time during transaction t ∈ As2.
Consider Ês, the median time for student s to enter an answer. Most answers are short
and require very little time to type. If we assume that the variance is small, then Êsâ•›≈â•›min
(Et), tâ•›∈â•›As, that is, because the variance is small, Et can be treated as a constant. We use the
minimum rather than a more common measure, like the mean, because we cannot directly
observe Et and must avoid outliers. If Ktâ•›≈â•›0, then the total time spent on a post-hint transac-
tion is approximately Et, as shown in Table 14.2. Thus, the minimum time student s spends
A Response-Time Model for Bottom-Out Hints as Worked Examples 207
on any answer step is a good approximation of Es. In practice, the observed Ês is about 1â•›s.
Using Ês, we can now estimate Kt for t ∈As1.
To isolate the reflection time, Rs, we need an approximation for K s2 , the thinking time
for transactions t ∈As2 . Unfortunately, K s2 is difficult to estimate. Instead, we will estimate
a value related to K s2 . The key observation is that if a student has already thought through
an answer on their own, without using any system help, they presumably engage in very
little reflection after they enter their solution. Mathematically, let Ns be the set of trans-
actions for a student s where they do not use a bottom-out hint. For cases tâ•›∈â•›Ns, we can
assume that Rtâ•›≈â•›0, that is, that students do not reflect. We can now use the following esti-
mator to isolate Rs:
(
Rs = Ts2 − K s2 + Es )
( ) ( )
= m Tt2 − K s2 + Es
≈ m (T ) − m (T )
t
2
t
N
where the substitution from the first to the second line derives from the assumption Rtâ•›≈â•›0,
tâ•›∈â•›Ns. This is the last estimator we require: we can estimate R s using the median time on
the first transaction of a step where the prior step was completed without worked exam-
ples, that is, where tâ•›∈â•›Ns. This approach avoids directly estimating K s2 , and estimates the
( )
sum K s2 + Es instead.
There is still the problem of off-task time. We have so far assumed that the off-task time
is approximately zero. We will continue to hold this assumption. While students engage in
long periods of off-task behavior, we assume that for most transactions, students are on-
task. This implies that transactions with off-task behaviors are rare, albeit potentially of long
duration. Since we use medians, we eliminate these outliers from consideration entirely, and
thus continue to assume that on any given transaction, off-task time is close to zero.
A subtle point is that the model will not fit well for end-of-problem transactions. At
the end of a problem there is a “done” step, where the student has to decide to hit
“done.” The model no longer accurately represents the student’s cognitive process as
it does not include the “done” decision. These transactions could be valuable to an
extended version of the model, but for this study, all end-of-problem transactions are
simply dropped.
14.5╇ Results
We first demonstrate the model for students in the control condition. These students were
not required to perform any formal reasoning steps. The goal is to predict the adjusted
pretest to posttest gain, which is the max of (post-pre)/(1 − pre) and (post-pre)/(pre). We
will not use Z-scores because the pretest suffered from a floor effect; so the pretest scores
are very non-normal (Shapiro–Wilks: pâ•›<â•›0.005). Two students were removed from the
population for having fewer than five bottom-out hint requests, bringing the population
down to 18. The results are shown in Table 14.3.
208 Handbook of Educational Data Mining
Table 14.3
Indicator Correlations in the Control
Condition
Pre Post Adjusted Gain
Ks −0.11 0.37* 0.34*
Rs 0.29 0.42** 0.36*
HINTs 0.04 0.53** 0.48**
*pâ•›<â•›0.10, **pâ•›<â•›0.05.
The first result of interest is that none of the indicators have statistically significant corre-
lations with the pretest. This suggests that they measure some state or trait of the students
that is not well captured by the pretest. The second result is that all three indicators corre-
late strongly with both the posttest and learning gain. Notably, HINTs, our main indicator,
has a correlation of about 0.5 with both the posttest and the learning gain. To the extent
that HINTs does distinguish between “good” and “bad” bottom-out hint behaviors, this
correlation suggests that the two types of behaviors are useful classifications.
It is possible that these indicators might only be achieving correlations comparable to
time-on-task or average transaction time. As Table 14.4 shows, this is clearly not the case.
All three hint time indicators outperform the traditional time-on-task measures, with the
minimum correlation in Table 14.3 equal to 0.34.
Nevertheless, these results still do not clarify whether the indicator HINTs is actually
measuring what it purports to measure: self-explanation on worked examples. To explore
this question, we use the experimental condition. In the experimental condition, students
were asked to justify their solutions by providing the mathematical theorem associated
with their answer. This changes the basic pattern of transactions from HINT-GUESS-
GUESS to HINT-GUESS-JUSTIFY-GUESS. We can now directly measure Rs using the time
spent on the new JUSTIFY step, since, in the experimental condition, the new justify step
requires a reasoning process. Rs is now the median time students spend on a correct jus-
tification step after a bottom-out hint, subtracting the minimum time they ever spend on
correct justifications. We use the minimum for reasons analogous to those of Ês. In this
condition, there were sufficient observations for all 19 students. The resulting correlations
are shown in Table 14.5.
There is almost no correlation between our indicators and the pretest score, again show-
ing that our indicators are detecting phenomena not effectively measured by the pretest.
The correlations with the posttest and learning gain are also high for both R s and HINTs.
While Rs by itself has a statistically significant correlation at p < 0.10, K s and Rs combined
demonstrate a statistically significant correlation at p < 0.05. This suggests that while some
students think about a bottom-out hint before entering the answer and some students
think about the hint after entering the answer, all students, regardless of style, benefit from
the time spent thinking about bottom-out hints when such hints are requested. The corol-
lary is that some bottom-out hints are beneficial to learning.
Table 14.4
Time-on-Task Correlations in the Control Condition
Pre Post Adjusted Gain
Time-on-task −0.31 −0.10 0.23
Average transaction time −0.03 0.27 0.20
A Response-Time Model for Bottom-Out Hints as Worked Examples 209
Table 14.5
Correlations in the Experimental
Condition
Pre Post Adjusted Gain
Ks −0.02 0.23 0.25
Rs 0.02 0.31* 0.35*
HINTs 0.00 0.38* 0.41**
*pâ•›<â•›0.10, **pâ•›<â•›0.05.
Table 14.6
Changes in Behavior in the Experimental Condition
Mean Var Pre Post Adjusted Gain
ΔHINTs 0.56 5.06 0.41* 0.47** 0.38*
*pâ•›<â•›0.10, **pâ•›<â•›0.05.
So far, we have shown that the indicator HINTs is robust enough to show strong correla-
tions with learning gain despite being measured in two different ways across two separate
conditions. The first set of results demonstrated that HINTs can be measured without any
direct observation of reasoning steps. The second set of results showed that direct obser-
vation of HINTs was similarly effective. Our data, however, allows us access to two other
interesting questions. First, does prompting students to explain their reasoning change
their bottom-out hint behavior? Second, do changes in this behavior correlate with learn-
ing gain?
To answer both questions, we look at the indicators trained on only the first 20% of each
student’s transactions. For this, we use only the experimental condition, since, when 80%
of the data is removed, the control condition has very few remaining students with very
few observations. Even in the experimental condition, only 15 students have more than
five bottom-out hint requests in the reduced data set. The results are shown in Table 14.6,
with ΔHINTs representing the difference between HINTs trained on the first 20% of the
data and HINTs trained on the full data.
To answer the first question, the change in HINTs due to the Aleven and Koedinger
intervention is not statistically different from zero. Their intervention, requiring ungraded
self-explanations, does not appear to encourage longer response times in the presence
of bottom-out hints, so this mechanism does not explain their experimental results [1].
However, some students do change behavior. As seen in Table 14.6, students who increased
their HINTs time had higher learning gains.
hints predicts learning. However, extending our results to practical use requires addi-
tional work.
Our indicators provide estimates for student thinking about bottom-out hints. However,
these estimates are aggregated across transactions, providing a student-level indicator.
The indicator could already be used for categorizing students and potentially offering
them individualized help; it could also be an informative component within a higher-
granularity model. However, at present, the indicator does not provide the level of granu-
larity required to choose specific moments for tutor intervention. To achieve that level
of granularity, a better distributional understanding of student response times would be
helpful, as would an indicator capable of distinguishing between students seeking worked
examples versus engaging in gaming. Exploring how the distribution of response times
differs between high-learning bottom-out hint students and low-learning bottom-out hint
students would help solve both problems.
This issue aside, our indicators for student self-explanation time have proven remark-
ably effective. They not only predict learning gain, they do so better than traditional time-
on-task measures, they are uncorrelated with pretest scores, and changes in our indicators
over time also predict learning gain. The indicators achieve these goals without inherent
assumptions about domain or system design, facilitating adaptation to other educational
systems in other domains. However, the actual results presented are derived from only one
well-defined domain and one heavily scaffolded tutoring system. While easily adapted to
other domains and systems, the indicators may or may not be accurate outside of geometry
or in other types of tutoring systems.
The inclusion of two conditions, one without justification steps and one with justification
steps, allowed us to show that the indicators do measure phenomena related to reasoning
or self-explanation. We estimated the indicators for the two conditions in different ways;
yet, both times, the results were significant. This provides a substantial degree of validity.
However, one direction for future work is to show that the indicators correlate with other
measures of self-explanation or worked example cognition. One useful study would be to
compare these indicators with human estimates of student self-explanation.
References
1. Aleven, V. and Koedinger, K. R. An effective meta-cognitive strategy: Learning by doing and
explaining with a computer-based cognitive tutor. Cognitive Science, 26(2), 2002, 147–179.
2. Aleven, V., Koedinger, K. R., and Cross, K. Tutoring answer explanation fosters learning with
understanding. In: Proceedings of the Ninth International Conference on Artificial Intelligence in
Education, LeMans, France, 1999.
3. Aleven, V., McLaren, B. M., Roll, I., and Koedinger, K. R. Toward tutoring help seeking—
Applying cognitive modeling to meta-cognitive skills. In: Proceedings of the Seventh Conference
on Intelligent Tutoring Systems, Alagoas, Brazil, 2004, pp. 227–239.
4. Atkinson, R. K., Renkl, A., and Merrill, M. M. Transitioning from studying examples to solv-
ing problems: Effects of self-explanation prompts and fading worked-out steps. Journal of
Educational Psychology, 95(4), 2003, 774–783.
5. Baker, R. S., Corbett, A. T., and Koedinger, K. R. Detecting student misuse of intelligent tutor-
ing systems. Proceedings of the Seventh International Conference on Intelligent Tutoring Systems,
Alagoas, Brazil, 2004, pp. 531–540.
A Response-Time Model for Bottom-Out Hints as Worked Examples 211
Contents
15.1 Introduction......................................................................................................................... 213
15.2 Related Work....................................................................................................................... 215
15.3 The AIspace CSP Applet Learning Environment.......................................................... 216
15.4 Off-Line Clustering............................................................................................................. 217
15.4.1 Data Collection and Preprocessing...................................................................... 218
15.4.2 Unsupervised Clustering...................................................................................... 219
15.4.3 Cluster Analysis...................................................................................................... 219
15.4.3.1 Cluster Analysis for the CSP Applet (kâ•›=â•›2)........................................... 219
15.4.3.2 Cluster Analysis for the CSP Applet (kâ•›=â•›3)........................................... 221
15.5 Online Recognition.............................................................................................................223
15.5.1 Model Evaluation (kâ•›=â•›2)......................................................................................... 224
15.5.2 Model Evaluation for the CSP Applet (kâ•›=â•›3).......................................................225
15.6 Conclusions and Future Work.......................................................................................... 226
References...................................................................................................................................... 227
15.1╇ Introduction
Exploratory learning environments (ELEs) provide facilities for student-led exploration
of a target domain with the premise that active discovery of knowledge promotes deeper
understandings than more controlled instruction [32]. Through the use of graphs and
animations, algorithm visualization (AV) systems aim to better demonstrate algorithm
dynamics than traditionally static media, and there has been interest in using them within
ELEs to promote interactive learning of algorithms [15,34]. Despite theories and intuitions
behind AVs and ELEs, reports on their pedagogical effectiveness have been mixed [8,34].
Research has suggested that pedagogical effectiveness is influenced by distinguishing
student characteristics such as metacognitive abilities [8] and learning styles [15,34]. For
example, some students often find such unstructured environments difficult to navigate
effectively and so they may not learn well with them [20]. Such findings highlight the
need for ELEs in general, and specifically for ELEs that use interactive AVs, to provide
adaptive support for students with diverse abilities or learning styles. This is a challeng-
ing goal because of the difficulty in observing distinct student behaviors in such highly
unstructured environments. The few efforts that have been made toward this goal mostly
213
214 Handbook of Educational Data Mining
rely on hand-constructing detailed student models that can monitor student behaviors,
assess individual needs, and inform adaptive help facilities [8,31]. This is a complex and
time-consuming task that typically requires the collaborative efforts of domain, applica-
tion, and model experts.
The authors in [10] explored an approach based on supervised machine learning, where
domain experts manually labeled interaction episodes based on whether or not students
reflected on the outcome of their exploratory actions. The resulting dataset was then used
to train a classifier for student reflection behavior that was integrated with a previously
developed knowledge-based model of student exploratory behavior. While the addition of
the classifier significantly improved model accuracy, this approach suffers from the same
drawbacks of knowledge-based approaches described earlier. It is time consuming and
error prone, because humans have to supply the labels for the dataset, and it needs a priori
definitions of relevant behaviors when there is limited knowledge of what these behaviors
may be.
In this chapter, we explore an alternative approach that addresses the above limitations
by relying on data mining to automatically identify common interaction behaviors, and
then using these behaviors to train a user model. The key distinction between our mod-
eling approach and knowledge-based or supervised approaches with hand-labeled data
is that human intervention is delayed until after a data mining algorithm has automati-
cally identified behavioral patterns. This means, instead of having to observe individual
student behaviors in search of meaningful patterns to model or to input to a supervised
classifier, the developer is automatically presented with a picture of common behavioral
patterns that can then be analyzed in terms of learning effects. Expert effort is potentially
reduced further by using supervised learning to build the user model from the identified
patterns. While these models are generally not as fine-grained as those generated by more
laborious approaches based on experts’ knowledge or labeled data, they may still pro-
vide enough information to inform soft forms of adaptivity in-line with the unstructured
nature of the interaction with ELEs.
In recent years, there has been a growing interest in exploring the usage of data min-
ing for educational technologies, or educational data mining (EDM). Much of the work on
EDM to date has focused on traditional intelligent tutoring systems (ITSs) that support
structured problem solving [5,33,39] or drill-and-practice activities (e.g., [6]) where stu-
dents receive feedback and hints based on the correctness of their answers. In contrast,
our work aims to model students as they interact with environments that support learning
via exploratory activities, like interactive simulations, where there is no clear notion of cor-
rect or incorrect behavior.
In this chapter, we present the results of applying this approach to an ELE called the
AIspace Constraint Satisfaction Problem (CSP) Applet [1]. We show that by applying unsu-
pervised clustering to log data, we identify interaction patterns that are meaningful to dis-
criminate different types of learners and that would be hard to detect based on intuition
or a basic correlation analysis. We also show preliminary results on using the identified
learner groups for online classification of new users. Our long-term goal is automatic inter-
face adaptations to encourage effective behaviors and prevent detrimental ones.
This chapter is organized as follows. Section 15.2 reports on related work. Section 15.3
describes the CSP Applet. Section 15.4 describes the application of unsupervised learning
for off-line identification of meaningful clusters of users. Section 15.5 illustrates how the
clusters identified in the off-line phase are used directly in a classifier student model. And
finally, in Section 15.6, we conclude with a summary and a discussion of future research
directions.
Automatic Recognition of Learner Types in Exploratory Learning Environments 215
approach to user modeling differs from these in that we are modeling student interac-
tion behaviors in unstructured environments with no clear definition of correct behavior
instead of static student solutions and errors.
Figure 15.1
AIspace CSP Applet interface. (From Amershi, S. and Conati, C., J. Educ. Data Mining, 1, 1, 2009. With permission.)
Automatic Recognition of Learner Types in Exploratory Learning Environments 217
limit our analysis to only those relevant to solving a predefined CSP. Here, we provide a
brief description of these functionalities necessary to understand the results of applying
our user modeling framework to this environment.
• Fine Step. Allows the student to manually advance through the AC-3 algorithm at a
fine scale. Fine Step cycles through three stages, triggered by consecutive clicks of
the Fine Step button. First, the CSP Applet selects one of the existing blue (untested)
arcs and highlights it. Second, the arc is tested for consistency. If the arc is consistent,
its color will change to green and the Fine Step cycle terminates. Otherwise, its color
changes to red and a third Fine Step is needed. In this final stage, the CSP Applet
removes the inconsistency by reducing the domain of one of the variables involved in
the constraint, and turns the arc green. Because other arcs connected to the reduced
variable may have become inconsistent as a result of this step, they must be retested
and thus are turned back to blue. The effect of each Fine Step is reinforced explicitly
in text through a panel above the graph (see message above the CSP in Figure 15.1).
• Step. Executes the AC-3 algorithm in coarser detail. One Step performs all three
stages of Fine Step at once, on a blue arc chosen by the algorithm.
• Direct Arc Click. Allows the student to choose which arc to Step on by clicking
directly on it.
• Domain Split. Allows a student to divide the network into smaller subproblems
by splitting a variable’s domain. This is done by clicking directly on a node in
the network, and then selecting values to keep in the dialog box that appears (see
dialog box at the lower-right corner of the CSP Applet in Figure 15.1). The choice of
variables to split on and values to keep affect the algorithm’s efficiency in finding
a solution.
• Backtrack. Recovers the alternate subproblem set aside by Domain Splitting, allow-
ing for a recursive application of AC-3.
• Auto Arc Consistency (Auto AC). Automatically Fine Steps through the CSP network,
at a user-specified speed, until it is consistent.
• Auto Solve. Iterates between Fine Stepping to reach graph consistency and automati-
cally Domain Splitting until a solution is found.
• Stop. Lets the student stop execution of Auto AC or Auto Solve at any time.
• Reset. Restores the CSP to its initial state so that the student can reexamine the
initial problem and restart the algorithm.
The data we use for the research described in this chapter was obtained from a previous
user study investigating user attitudes for the CSP Applet [1]. In Section 15.4, we describe
how we use logged data from this study to identify different groups of learners via unsu-
pervised clustering.
distinct interaction behaviors, and online classification of new students based on these
clusters. In this section, we focus on the off-line phase. This phase starts with the collec-
tion and preprocessing of raw, unlabeled data from student interaction with the target
environment. The result of preprocessing is a set of feature vectors representing indi-
vidual students in terms of their interaction behavior. These vectors are then used as input
to a clustering algorithm that groups them according to their similarity. The resulting
groups, or “clusters,” represent students who interact similarly with the environment.
These clusters are then analyzed by the model developer in order to determine whether
and how they represent interaction behaviors that are effective or ineffective for learning.
In Sections 15.4.1 through 15.4.3, we detail the various steps involved in the off-line phase
in the context of student interaction with the CSP Applet.
and practically) significantly (pâ•›<â•›.05 and dâ•›>â•›.8, respectively) higher learning gains (7 points)
than the other cluster (20 students, 3.08 points gain). Hereafter, we will refer to these clus-
ters as ‘HL’ (high learning) cluster, and ‘LL’ (low learning) cluster respectively.
In order to characterize the HL and LL clusters in terms of distinguishing student inter-
action behaviors, we did a pair-wise analysis of the differences between the clusters along
each of the 21 dimensions. Table 15.1 summarizes the results of this analysis, where the
features reported are those for which we found statistically or practically significant dif-
ferences. Here, we interpret the differences along these individual feature dimensions, or
discuss combinations of dimensions that yielded sensible results. The results on the use
of the Fine Step feature are quite intuitive. From Table 15.1, we can see that the LL students
used this feature significantly more frequently than the HL students. In addition, both the
latency averages and standard deviations after a Fine Step were significantly shorter for
the LL students, indicating that they Fine Stepped frequently and consistently too quickly.
These results plausibly indicate that LL students may be using this feature mechanically,
without pausing long enough to consider the effects of each Fine Step, a behavior that may
contribute to the LL gains achieved by these students.
The HL students used the Auto AC feature more frequently than the LL students, although
the difference is not statistically significant. In isolation, this result appears unintuitive
considering that simply watching the AC-3 algorithm in execution is an inactive form of
learner engagement [24]. However, in combination with the significantly higher frequency
of Stopping (see “Stop frequency” in Table 15.1), this behavior suggests that the HL students
could be using these features to forward through the AC-3 algorithm in larger steps to
analyze it at a coarser scale, rather than just passively watching the algorithm progress.
The HL students also paused longer and more selectively after Resetting than the LL
students (see “Reset latency average” and “Reset latency SD” entries in Table 15.1). With the
hindsight that these students were successful learners, we can interpret this behavior as an
indication that they were reflecting on each problem more than the LL students. However,
without the prescience of learning outcomes, it is likely that an application expert or edu-
cator observing the students would overlook this less obvious behavior.
There was also a significant difference in the frequency of Domain Splitting between the
HL and LL clusters of students, with the LL cluster frequency being higher (see “Domain
Split frequency” in Table 15.1). As it is, it is hard to find an intuitive explanation for this
result in terms of learning. However, the analysis of the clusters found with kâ•›=â•›3 in
Section 15.4.3.2 shows finer distinctions along this dimension, as well as along the latency
Table 15.1
Pair-Wise Feature Comparisons between HL and LL Clusters for kâ•›=â•›2
Feature Description HL Average LL Average p Cohen’s d
Fine Step frequency .025 .118 6e-4* 1.34*
Fine Step latency average 10.2 3.08 .013* 1.90*
Fine Step latency SD 12.2 4.06 .005* 2.04*
Stop frequency .003 7e-4 .058 .935*
Stop latency SD 1.06 0 .051 1.16*
Reset latency average 46.6 11.4 .086 .866*
Reset latency SD 24.4 9.56 .003* 1.51*
Domain Split frequency .003 .009 .012* .783
Source: Amershi, S. and Conati, C., J. Educ. Data Mining, 1, 1, 2009. With permission.
* Significant at pâ•›<â•›.05 or dâ•›>â•›.8 (values in bold).
dimensions after a Domain Split action. These latter findings are more revealing, indicating
that there are likely more than two learning patterns.
Table 15.2
Three-Way Comparisons between HL, LL1, and LL2 Clusters
Feature Description HL Average LL1 Average LL2 Average F p Partial η2
Fine Step frequency .025 .111 .122 1.98 .162 .159*
Fine Step latency average 10.2 3.07 3.08 20.4 1e-5* .660*
Fine Step latency SD 12.2 4.82 3.55 12.1 3e-4* .536*
Auto AC frequency .007 .003 .004 2.66 .093 .202*
Stop frequency .003 3e-4 9e-4 3.00 .071 .222*
Stop latency SD 1.06 0 0 15.8 6e-4* .600*
Reset latency average 46.6 18.7 6.52 6.94 .005* .398*
Reset latency SD 24.4 14.2 6.43 5.09 .016* .327*
Domain Split frequency .003 .018 .003 12.0 3e-4* .532*
Domain Split latency 6.75 8.68 1.89 12.0 3e-4* .533*
average
Domain Split latency SD 1.37 6.66 .622 27.7 1e-6* .725*
Backtrack latency average 1.75 8.90 .202 3.21 .061 .234*
Backtrack latency SD 0 7.96 .138 2.92 .076 .218*
Source: Amershi, S. and Conati, C., J. Educ. Data Mining, 1, 1, 2009. With permission.
* Significant at pâ•›<â•›.05 or partial η2â•›>â•›.14 (values in bold).
Table 15.3
Post Hoc Pair-Wise Comparisons between HL, LL1, and LL2 Clusters
HL versus LL1 HL versus LL2 LL1 versus LL2
Feature Description pHSD d pHSD d pHSD d
Fine Step frequency .142 1.10* .078 1.48* .691 .106
Fine Step latency average 1e-5* 1.98* 1e-5* 1.85* .818 .007
Fine Step latency SD .001* 1.68* 1e-4* 2.33* .395 .356
Auto AC frequency .046* .745 .076 .666 .595 .228
Stop frequency .031* 1.21* .081 .783 .449 .296
Stop latency average .552 .966* .692 .169 .287 .387
Stop latency SD 5e-5* 1.16* 3e-5* 1.16* .823 0
Reset latency average .031* .673 .002* 1.01* .194 .867
Reset latency SD .136 1.08* .007* 1.84* .125 .601
Domain Split frequency .003* 1.91* .820 .011 2e-4* 1.69*
Domain Split latency average .350 .483 .019* 1.24* 1e-4* 1.79*
Domain Split latency SD 1e-4* 2.83* .488 .527 0* 2.73*
Backtrack latency average .167 .745 .667 .648 .028* .934*
Backtrack latency SD .118 .867* .811 .556 .042* .851*
Source: Amershi, S. and Conati, C., J. Educ. Data Mining, 1, 1, 2009. With permission.
* Significant at pHSDâ•›<â•›.05 or dâ•›>â•›.8 (values in bold).
between the HL and LL1 clusters), suggesting that the HL students were using these fea-
tures to selectively forward through the AC-3 algorithm to learn. The HL students also
paused longer and more selectively after Resetting than both the LL1 and LL2 students,
suggesting that the HL students may be reflecting more on each problem.
The k-3 clustering also reveals several additional patterns, not only between the HL and
LL clusters, but also between the two LL clusters, indicating that kâ•›=â•›3 was better at dis-
criminating relevant student behaviors. For example, the k-2 clusters showed that the LL
students used the Domain Split feature more frequently than the HL students; however, the
k-3 clustering reveals a more complex pattern. This pattern is summarized by the follow-
ing combination of findings:
• LL1 students used the Domain Split feature significantly more than the HL students.
• HL and LL2 students used the Domain Split feature comparably frequently.
• HL and LL1 students had similar pausing averages after a domain split, and
paused significantly longer than the LL2 students.
• LL1 students paused significantly more selectively (had a higher standard devia-
tion for pause latency) than both HL and LL2 students.
• LL1 had longer pauses after Backtracking than both HL and LL2 clusters.
indicate that long pauses for LL1 students indicated confusion about these Applet fea-
tures or the concepts of Domain Splitting and backtracking, rather than effective reflec-
tion. This is indeed a complex behavior that may have been difficult to identify through
mere observation.
1
0.9
0.8
% Correct classifications
0.7
0.6
0.5
0.4
Overall
0.3
HL cluster
0.2 LL cluster
0.1 Baseline
0
1 10 19 28 37 46 55 64 73 82 91 100
% Actions seen over time
Figure 15.2
Performance of CSP Applet user models (kâ•›=â•›2) over time. (From Amershi, S. and Conati, C., J. Educ. Data Mining,
1, 1, 2009. With permission.)
HL students is better than the baseline, it may still cause a system based on this model to
interfere with an HL student’s natural learning behavior, thus hindering student control,
one of the key aspects of ELEs. The imbalance between accuracy in classifying LL and
HL students is likely due to the distribution of the sample data [38] as the HL cluster has
fewer data points than the LL cluster (4 compared to 20). This is a common phenomenon
observed in classifier learning. Collecting more training data to correct for this imbalance,
even if the cluster sizes are representative of the natural population distributions, may
help to increase the classifier user model’s accuracy on HL students [38].
0.9
0.8
% Correct classification
0.7
0.6
0.5
0.4
0.3 Overall
Baseline
0.2
0.1
0
1 10 19 28 37 46 55 64 73 82 91 100
% Actions seen over time
Figure 15.3
Performance of CSP Applet user models (kâ•›=â•›3) over time. (From Amershi, S. and Conati, C., J. Educ. Data Mining,
1, 1, 2009. With permission.)
0.9
0.8
0.6
0.5
0.4
0.3 HL group
LL1 group
0.2 LL2 group
0.1
0
1 10 19 28 37 46 55 64 73 82 91 100
% Actions seen over time
Figure 15.4
Performance of CSP Applet user models (kâ•›=â•›3) over time for individual clusters. (From Amershi, S. and
Conati, C., J. Educ. Data Mining, 1, 1, 2009. With permission.)
the actions. The accuracy of the model at classifying LL1 students (dotted line) also begins
low, but then reaches approximately 75% after seeing about 60% of the actions, and con-
verges to approximately 85%. The accuracy for the LL2 students (solid line) remains rela-
tively consistent as actions are observed, eventually reaching approximately 75%. As with
the increase in the SC, the lower accuracy, sensitivity, and specificity of this classifier user
model is likely an artifact of the fewer data points within each cluster. Further supporting
this hypothesis is the fact that the LL2 cluster, which had 12 members, had the highest
classification accuracy (80.3% averaged over time), whereas the HL and LL1 clusters, which
had only four and eight members, respectively, had visibly lower classification accuracies
(66.3% and 44.9% averaged over time, respectively). Therefore, more training data should
be collected and used, particularly as the number of clusters increases, when applying our
user modeling framework.
We applied our approach to build the user model for the CSP Applet, an exploratory
environment that uses interactive visualizations to help students understand an algorithm
for constraint satisfaction. We presented results showing that, despite limitations due to
the availability of data, our approach is capable of detecting meaningful clusters of student
behaviors, and can achieve reasonable accuracy for the online categorization of new stu-
dents in terms of the effectiveness of their learning behaviors.
The next steps of this research include testing our proposed approach with larger datas-
ets, and experimenting with other clustering algorithms, in particular, a probabilistic vari-
ant of k-means called expectation maximization (EM) [11].
We will then investigate how to use the results of the online student modeling phase
to provide adaptive support during interaction with AIspace. We are planning to experi-
ment with a multilayered interface design [29], where each layer’s mechanisms and help
resources are tailored to facilitate learning for a given learner group identified by clus-
tering. Then, based on a new learner’s classification, the environment could select the
most appropriate interface layer for that learner. For instance, the AIspace CSP Applet
may select a layer with Fine Step disabled or with a subsequent delay to encourage care-
ful thought for those students classified as ineffective learners by the two-class classifier
user model described in Section 15.5.1. Similarly, for the three-class case, the CSP Applet
could disable or introduce a delay after Fine Step for students classified into either of the
ineffective learner groups. Additionally, in this case, the CSP Applet could also include
a delay after Domain Splitting for students classified into the LL2 (low learning 2) group,
as these students were consistently hasty in using this feature (see Section 15.4.2.3). The
other ineffective learner group, LL1 (low learning 1), discovered by our framework in
this experiment was characterized by lengthy pauses after Domain Splitting as well as
Backtracking, indicating confusion about these CSP Applet mechanisms or concepts (see
Section 15.4.2.2). Therefore, general tips about Domain Splitting and Backtracking could be
made more prominent for these particular students for clarification purposes.
Finally, we want to test the generality of our approach by applying it to other learning
environments. We have already obtained positive results similar to those described in the
paper with an ELE for mathematical functions [2]. We now want to experiment with apply-
ing the approach to educational computer games, educational environments in which the
exploratory component is integrated into gamelike activities.
References
1. Amershi, S., Carenini, C., Conati, C., Mackworth, A., and Poole, D. 2008. Pedagogy and usability
in interactive visualizations—Designing and evaluating CIspace. Interacting with Computers—
The Interdisciplinary Journal of Human-Computer Interaction 20 (1), 64–96.
2. Amershi, S. and Conati, C. 2007. Unsupervised and supervised machine learning in user mod-
eling for intelligent learning environments. Proceedings of Intelligent User Interfaces, Honolulu,
HI, pp. 72–81.
3. Amershi, S. and Conati, C. 2009. Combining unsupervised and supervised machine learning
to build user models for exploratory learning environments. The Journal of Educational Data
Mining, 1, 1.
4. Ayers, E., Nugent, R., and Dean, N. 2008. Skill set profile clustering based on weighted student
responses. Proceedings of the 1st International Conference on Educational Data Mining, Montreal,
Quebec, Canada, pp. 210–217.
5. Baker, R. S. J. D., Corbett, A. T., Roll, I., and Koedinger, K. R. 2008. Developing a generalizable
detector of when students game the system. User Modeling and User-Adapted Interaction 18 (3),
287–314.
6. Beck, J. 2005. Engagement tracing: Using response times to model student disengagement.
Proceedings of the International Conference on Artificial Intelligence in Education, Amsterdam,
the Netherlands.
7. Beck, J. and Woolf, B. P. 2000. High-level student modeling with machine learning. Proceedings
of Intelligent Tutoring Systems, Montreal, Quebec, Canada.
8. Bunt, A. and Conati, C. 2003. Probabilistic student modeling to improve exploratory behavior.
UMUAI 13 (3), 269–309.
9. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Hillsdale, NJ:
Lawrence Erlbaum Associates.
10. Conati, C. and Merten, C. 2007. Eye-tracking for user modeling in exploratory learning envi-
ronments: An empirical evaluation. Knowledge Based Systems 20(6), 557–574.
11. Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification, 2nd edn. New York:
Wiley-Interscience.
12. Faraway, J. J. 2002. Practical Regression and ANOVA Using R. http://www.maths.bath.
ac.uk/~jjf23/book/pra.pdf
13. Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Annals of Eugenics
7 (2), 179–188.
14. Gorniak, P. J. and Poole, D. 2000. Building a stochastic dynamic model of application use.
Proceedings of UAI, San Francisco, CA.
15. Hundhausen, C. D., Douglas, S. A., and Stasko, J. T. 2002. A meta-study of algorithm visualiza-
tion effectiveness. Visual Languages and Computing 13 (3), 259–290.
16. Hunt, E. and Madhyastha, T. 2005. Data mining patterns of thought. Proceedings of the AAAI
Workshop on Educational Data Mining, Pittsburgh, PA.
17. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Computing
Surveys 31 (3), 264–323.
18. Johns, J. and Woolf, B. 2006. A dynamic mixture model to detect student motivation and
�proficiency. Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA,
pp. 163–168.
19. Kearns, M. and Ron, D. 1997. Algorithmic stability and sanity-check bounds for leave-one-out
cross-validation. Proceedings of Computational Learning Theory, Nashville, TN.
20. Kirschner, P., Sweller, J., and Clark, R. 2006. Why minimal guidance during instruction does not
work: An analysis of the failure of constructivist, discovery, problem-based, experimental and
inquiry-based teaching. Educational Psychologist 41 (2), 75–86.
21. Lange, T., Braun, M. L., Roth, V., and Buhmann, J. M. 2003. Stability-based model selection.
Proceedings of NIPS, Vancouver, Whistler, Canada.
22. Mayo, M. and Mitrovic, A. 2001. Optimising ITS behavior with Bayesian networks and decision
theory. Artificial Intelligence in Education 12, 124–153.
23. Mobasher, B. and Tuzhilin A. 2009. Special issue on data mining for personalization, Journal of
User Modeling and User-Adapted Interaction, 19 (1–2).
24. Naps, T. L., Rodger, S., Velzquez-Iturbide, J., RÖßling, G., Almstrum, V., Dann, W. et al. 2003.
Exploring the role of visualization and engagement in computer science education. ACM
SIGCSE Bulletin 35 (2), 131–152.
25. Perera, D., Kay, J., Yacef, K., Koprinska, I., and Zaiane, O. 2009. Clustering and sequential
pattern mining of online collaborative learning data. Proceedings of the IEEE Transactions on
Knowledge and Data Engineering, 21(6), 759–772.
26. Poole, D., Mackworth, A., and Goebel, R. 1998. Computational Intelligence: A Logical Approach.
New York: Oxford University Press.
27. Romero, C., Ventura, S., Espejo, P. G., and Hervas, C. 2008. Data mining algorithms to classify
students. Proceedings of Educational Data Mining, Montreal, Quebec, Canada, pp. 8–17.
Automatic Recognition of Learner Types in Exploratory Learning Environments 229
28. Rodrigo, M. M. T., Anglo, E. A., Sugay, J. O., and Baker, R. S. J. D. 2008. Use of unsupervised
clustering to characterize learner behaviors and affective states while using an intelligent tutor-
ing system. Proceedings of International Conference on Computers in Education, Taipei, Taiwan.
29. Schneiderman, B. 2003. Promoting universal usability with multi-layer interface design.
Proceedings of the ACM Conference on Universal Usability, Vancouver, British Columbia, Canada.
30. Shih, B., Koedinger, K., and Scheines, R. 2008. A response time model for bottom-out hints as
worked examples. Proceedings of the 2nd International Conference on Educational Data Mining,
Montreal, Quebec, Canada.
31. Shute, V. 1994. Discovery learning environments: Appropriate for all? Proceedings of the American
Educational Research Association, New Orleans, LA.
32. Shute, V. and Glaser, V. 1990. A large-scale evaluation of an intelligent discovery world.
Interactive Learning Environments 1, 51–76.
33. Sison, R., Numao, M., and Shimura, M. 2000. Multistrategy discovery and detection of novice
programmer errors. Machine Learning 38, 157–180.
34. Stern, L., Markham, S., and Hanewald, R. 2005. You can lead a horse to water: How students
really use pedagogical software. Proceedings of the ACM SIGCSE Conference on Innovation and
Technology in Computer Science Education, St-Louis, MO.
35. Suarez, M. and Sison, R. 2008. Automatic construction of a bug library for object oriented novice
java programming errors. Proceedings of Intelligent Tutoring Systems, Montreal, Quebec, Canada.
36. Talavera, L. and Gaudioso, E. 2004. Mining student data to characterize similar behavior groups
in unstructured collaboration spaces. Proceedings of the European Conference on AI Workshop on AI
in CSCL, Valencia, Spain.
37. Walonoski, J. A. and Heffernan, N. T. 2006. Detection and analysis of off-task gaming behav-
ior in intelligent tutoring systems. Proceedings of the 8th International Conference on Intelligent
Tutoring Systems, Jhongli, Taiwan.
38. Weiss, G. M. and Provost, F. 2001. The effect of class distribution on classifier learning: An
empirical study. (Technical No. ML-TR-44). Rutgers University, New Brunswick, NJ.
39. Zaiane, O. 2002. Building a recommender agent for e-learning systems. Proceedings of the
International Conference on Computers in Education, Auckland, New Zealand.
16
Modeling Affect by Mining Students’
Interactions within Learning Environments
Contents
16.1 Introduction......................................................................................................................... 231
16.2 Background.......................................................................................................................... 232
16.3 Methodological Considerations........................................................................................ 233
16.4 Case Studies.........................................................................................................................234
16.4.1 Case Study 1: Detecting Affect from Dialogues with AutoTutor.....................234
16.4.1.1 Context.......................................................................................................234
16.4.1.2 Mining Dialogue Features from AutoTutor’s Log Files......................234
16.4.1.3 Automated Dialogue-Based Affect Classifiers.................................... 235
16.4.2 Case Study 2: Predictive Modeling of Student-Reported Affect from
Web-Based Interactions in WaLLiS...................................................................... 235
16.4.2.1 Context....................................................................................................... 235
16.4.2.2 Machine Learned Models from Student–System Interactions.......... 236
16.5 Discussion............................................................................................................................ 238
16.6 Conclusions.......................................................................................................................... 240
Acknowledgments....................................................................................................................... 241
References...................................................................................................................................... 241
16.1╇ Introduction
In the past decade, research on affect-sensitive learning environments has emerged as an
important area in artificial intelligence in education (AIEd) and intelligent tutoring systems
(ITS) [1–6]. These systems aspire to enhance the effectiveness of computer-mediated tuto-
rial interactions by dynamically adapting to individual learners’ affective and cognitive
states [7] thereby emulating accomplished human tutors [7,8]. Such dynamic adaptation
requires the implementation of an affective loop [9], consisting of (1) detection of the Â�learner’s
affective states, (2) selection of systems actions that are sensitive to a learner’s affective
and cognitive states, and sometimes (3) synthesis of emotional expressions by �animated
�pedagogical agents that simulate human tutors or peer learning companions [9,10].
The design of affect-sensitive learning environments is grounded in research that states
that the complex interplay between affect and cognition during learning activities is of
crucial importance to facilitating learning of complex topics, particularly at deeper levels
of comprehension [11–17]. Within this research, one particular area of interest is concerned
231
232 Handbook of Educational Data Mining
with the ways in which human tutors detect and respond to learners’ affective states: robust
detection of learners’ affect is critical to enabling the development of affect-Â�sensitive ITSs.
In this chapter, we will examine the state-of-the art methods by which such detection can
be facilitated. In particular, we examine the issues that arise from the use of supervised
machine-learning techniques as a method for inferring learners’ affective states based on
features extracted from students’ naturalistic interactions with computerized learning
environments. Our focus is on general methodological questions related to educational
data mining (EDM) with an emphasis on data collection protocols and machine-learning
techniques for modeling learners’ affective states. We present two case studies that dem-
onstrate how particular EDM techniques are used to detect learners’ affective states based
on parameters that are collected during learners’ interactions with different learning envi-
ronments. We conclude with a critical analysis of the specific research outcomes afforded
by the methods and techniques employed in the case studies that we present.
16.2╇ Background
There is an increasing body of research that is concerned with identifying the affective
states that accompany learning and devising ways to automatically detect them during
real interactions within different educational systems [18–24].
Different approaches to detecting affect focus on monitoring facial expressions, Â�acoustic–
prosodic features of speech, gross body language, and physiological measures such as skin
conductivity or heart-rate monitoring. For extensive reviews, the reader is referred to Â�[25–31].
Another approach involves an analysis of a combination of lexical and discourse features
with acoustic–prosodic and lexical features obtained through a learner’s interaction with
spoken dialogue systems. A number of research groups have reported that appending
an acoustic–prosodic and lexical feature vector with dialogue features results in a 1%–4%
improvement in classification accuracy [32–34]. While these approaches have been shown
to be relatively successful in detecting affective states of the learners, some tend to be quite
expensive and some of them can be intrusive and may interfere with the learning process.
An interesting alternative to physiological and bodily measures for affect detection is to
focus on learners’ actions that are observable in the heat of the moment, i.e., at the time at
which they are produced by the learner in the specific learning environment [6]. One obvi-
ous advantage of referring to these actions is that the technology required is less expensive
and less intrusive than physiological sensors. Furthermore, by recording learner’s actions
as they occur, it is possible to avoid imposing additional constraints on the learner (i.e.,
no cameras, gloves, head gear, etc.), thereby also reducing the risk of interference with the
actual learning process, and it somewhat alleviates the concern that learners might dis-
guise the expression of certain negative emotions.
In this chapter, we present two case studies that differ from the previous research in that
we focus on interaction logs as this primarily means to infer learner affect. The first study
considers a broad set of features, including lexical, semantic, and contextual cues, as the
basis for detecting learners’ affective states. The second case study employs an approach
that relies only on student actions on the interactive feature of the environment such as
hint or information buttons to infer learner affect.
Section 16.3 discusses general methodological considerations behind these two
approaches before presenting the two case studies in more detail.
Modeling Affect by Mining Students’ Interactions within Learning Environments 233
The methods listed above have possible advantages and disadvantages. The reader is
referred to [46,48], where these are discussed in detail. The rest of this chapter presents
two case studies that demonstrate the use of the data-collection methodology and data-
mining techniques.
were ordered onto a scale of conversational directness, ranging from −1 to 1, in terms of the
amount of information the tutor explicitly provides the student. AutoTutor’s short feedback
(negative, neutral negative, neutral, neutral positive, positive) is manifested in its verbal
content, intonation, and a host of other nonverbal cues. The feedback was aligned on a 5
point scale ranging from −1 (negative) to 5 (positive feedback).
was obtained out of 209 students who were already familiar with WaLLiS. These students
were using the system at their own time and location, while they were studying for a real
course of their immediate interest. The sample was selected on the basis of disproportion-
ate stratified random sampling [54] and included students with different mathematical
abilities and awareness of their own abilities.
Similar to the previous case study, data was collected from two studies, one with students
retrospectively reporting on their own affect and one with tutors watching replays of stu-
dents’ interactions. Only the first study collected substantial amount of data to conduct a
machine-learning analysis. The second met significant difficulties, because, in the absence
of other information (e.g., participants’ face) tutors found it very difficult to provide judg-
ments of students’ affect. Hence, this study was only used for qualitative analysis and to
validate the models derived from the machine-learning procedures as described below.
≤0 >0
(d) (b)
Ability Decr (28/2)
Decr (19/1) Noreport (9) Decr (22/3) Incr (19/5) Incr (6) Noreport (12) Decr (17) Noreport (10)
FIGURE 16.1
Graphical representation of the decision tree for confidence. Each node represents an attribute, and the labels on the edges between nodes indicate the possible values
of the parent attribute. Following the path from a root to a leaf results in a rule that shows the value of the factor over the values of the attributes in the path. The
numbers in brackets next to the designated class indicate the number of instances correctly classified by this rule over the misclassified ones. Nodes with layers (e.g.,
node a) originate from a history vector. For example, AnswerIncorr > 2 shows that in the relevant time window, more than two incorrect answers were provided by
the student.
237
238 Handbook of Educational Data Mining
shows a graphic representation of the tree resulting from the merged sets of instances
with attributes action, difficulty, student ability, last answer, last feedback, time, and the history
vector. As an example, the rules associated with the confirm-answer action of students
are described. The tree suggests that when students confirm an incorrect answer (node a)
after having requested at least one hint previously, their confidence decreases (leaf b). If no
hints were requested, leaf (c) suggests that if they previously had many (more than two)
incorrect answers, then they report that their confidence decreases (the two misclassified
instances in leaf (c) are of an extreme decrease). Otherwise, it seems that the outcome
depends on students’ previous knowledge, the difficulty of the question, and the necessity
of the help request (node d).
The rest of the tree can be interpreted in a similar fashion. For example, the rule associ-
ated with node (e), where students are requesting hints after submitting a wrong answer
shows that students’ reports vary depending on whether or not their previous answer
was partially correct. A closer inspection of the rule suggests that in situations where the
students provided a partially correct answer, and where the system responds to such an
answer with negative feedback, the students’ confidence level tends to drop. This is par-
ticularly the case for students who do not spend sufficient time to read and interpret the
hints provided by the system (node f).
Overall, with the addition of history in the vector, cross-validation performed in the
decision tree for confidence indicated that the tree correctly classified 90.91% of the cases
(Kappaâ•›=â•›0.87). Accuracy for effort was 89.16%, Kappaâ•›=â•›0.79; these can be considered as
satisfactory results. Although, in this case, the addition of history did not improve the
results significantly, such information can be very useful in situations like open-learner
modeling where it would be important to communicate to the student the rationale behind
the system’s decisions.
The results demonstrate that rule induction provides a mechanism for deriving a pre-
dictive model in the form of rules that is based on students’ actions with the system. Most
of these rules are intuitive but defining them by hand would have required a thorough,
operational understanding of the processes involved, not easily achieved by experts in
the field. Although the process for collecting data was time consuming and led to small
amount of unequivocal rules, the methodology and machine-learning method is general-
izable to different situations resulting in at least hypotheses about rules that can guide the
design of future studies.
16.5╇ Discussion
The case studies presented in this chapter demonstrate the benefits of using EDM tech-
niques to monitor complex mental states in educational settings. They also identify several
important challenges for the EDM field, particularly in relation to the prediction of learner
affect. We highlighted different methods for collecting and annotating data of students’
interaction, the importance of ecological validity, as well as the difficulty of achieving it.
Several chapters in this book provide examples of the types of analysis and research that
recently has become possible owing to the availability of data from systems integrated in
real pedagogical situations. However, compared to other applications of EDM, affect pre-
diction from data introduces additional challenges to the ones faced when investigating,
e.g., the effects of the interaction in learning.
Modeling Affect by Mining Students’ Interactions within Learning Environments 239
history of a tutorial interaction into account, but further investigation reveals that even
the antecedent values of reported factors also play a role in tutors’ diagnosis—even for the
same sequence of events, tutors’ actions (and their subsequent verbalizations and reports)
are affected by values reported earlier for the same factor [46].
A possible solution that is emerging is to compare, aggregate and consolidate models
developed from different sources of data (e.g., self-reports and peer or tutor reports). On
the one hand, learners are a more valid source of evidence for reporting their own affective
states such as their level of confidence, than the tutors. On the other hand, tutors may be
better suited than the learners to judge learners’ boredom or effort as well as to report on
how such judgments can be used to support learning. Actor–observer biases may also play
an important role in the judgments of fuzzy, vague, ill-defined constructs such as affective
states. Learners might provide one set of categories by attributing their states to situational
factors, while observers (peers and trained judges) might make attributions to stable dis-
positional factors, thereby obtaining an alternate set of categories [58].
Despite these qualifications, examples of how to derive models from different sources of
data, can be found in [6], where different branches of decision trees are manually aggre-
gated. Similar examples appear in [50]. If done using automatic methods, this approach
has the potential to increase the precision of the models generated. The issue of auto-
matically aggregating models automatically has been investigated in detail in the field
of data mining [59,60]. In addition, the need to consider and merge different perspectives
resembles the emerging requirements behind reconciling models in the field of ontologies
e.g., [61,62]. Insights of how this could be achieved appear in [63]. A particularly relevant
example is the work presented in [64], where a user’s and an expert’s conceptual model are
compared. Developing formal ways to perform such measurements is necessary to enable
the reduction of the bias introduced by researchers’ intuitions.
16.6╇ Conclusions
As illustrated by the case studies, monitoring students’ interaction parameters can provide
a cost-effective, nonintrusive, efficient, and effective method to automatically detect com-
plex phenomena such as the affective states that accompany learning. However, several
methodological issues emerged and were discussed in this chapter. One important aspect
to consider was the context of data collection and the inevitable interference created by
affective labels collection. Another important issue that was flagged is students’ familiar-
ity with the learning environment and their goals when using it. Finally, two methods of
monitoring affective states were described and employed in the case studies: (a) real-time
measurements by means of emote-aloud protocols or observations and (b) retrospective
affect judgment by participant and/or tutors.
Although the studies used specific systems and measured certain affective states, the
methodology for data collection and machine-learning methods employed are generaliz-
able and could serve as guidance for the design of other studies. However, several impor-
tant questions still remain. There is the question of how these features can be coupled
with the bodily measures such as facial features, speech contours, and body language, as
well as physiological measures such as galvanic skin response, heart rate, respiration rate,
etc. Detection accuracies could be increased by implementing hybrid models and con-
solidating their outputs. There is also the question of how computer tutors might adapt
Modeling Affect by Mining Students’ Interactions within Learning Environments 241
Acknowledgments
D’Mello and Graesser would like to acknowledge the National Science Foundation (REC
0106965, ITR 0325428, HCC 0834847) for funding this research. Any opinions, findings and
conclusions, or recommendations expressed in this chapter are those of the authors and do
not necessarily reflect the views of NSF.
References
1. Arroyo, I., Cooper, D., Burleson, W., Woolf, B., Muldner, K., and Christopherson, R., Emotion
sensors go to school, in Dimitrova, V., Mizoguchi, R., du Boulay, B., and Graesser, A. (eds.),
Proceedings of the 14th International Conference on Artificial Intelligence in Education: Building
Learning Systems that Care: From Knowledge Representation to Affective Modelling, Vol. 200,
Brighton, U.K., IOS Press, Amsterdam, the Netherlands, 2009, pp. 17–24.
2. Forbes-Riley, K., Rotaru, M., and Litman, D., The relative impact of student affect on perfor-
mance models in a spoken dialogue tutoring system, User Modeling and User-Adapted Interaction
18(1), 11–43, 2008.
3. Conati, C. and Maclaren, H., Empirically building and evaluating a probabilistic model of user
affect, User Modeling and User-Adapted Interaction 19(3), 267–303, 2009.
4. Robison, J., McQuiggan, S., and Lester, J., Evaluating the consequences of affective feedback
in intelligent tutoring systems, in International Conference on Affective Computing and Intelligent
Interaction, Amsterdam, the Netherlands, 2009, pp. 1–6.
5. D’Mello, S., Craig, S., Fike, K., Graesser, A., and Jacko, J., Responding to learners’ cognitive-
affective states with supportive and shakeup dialogues, Human-Computer Interaction: Ambient,
Ubiquitous and Intelligent Interaction, Springer, Berlin/ Heidelberg, Germany, 2009, pp. 595–604.
6. Porayska-Pomsta, K., Mavrikis, M., and Pain, H., Diagnosing and acting on student affect: The
tutor’s perspective, User Modeling and User-Adapted Interaction 18(1), 125–173, 2008.
7. Lepper, M. R., Woolverton, M., Mumme, D. L., Gurtner, J., Lajoie, S. P., and Derry, S. J.,
Motivational techniques of expert human tutors: Lessons for the design of computer-based
tutors, Computers as Cognitive Tools, Lawrence Erlbaum Associates, Hillsdale, NJ, 1993,
pp. 75–107.
8. Goleman, D., Emotional Intelligence: Why It Can Matter More than IQ, Boomsbury, London, U.K.,
1996.
9. Conati, C., Marsella, S., and Paiva, A., Affective interactions: The computer in the affective
loop, in Proceedings of the 10th International Conference on Intelligent User Interfaces, ACM Press,
San Diego, CA, 2005, p. 7.
10. Conati, C., Probabilistic assessment of user’s emotions in educational games, Journal of Applied
Artificial Intelligence 16(7–8), 555–575, 2002.
242 Handbook of Educational Data Mining
11. Carver, C., Negative affects deriving from the behavioral approach system, Emotion 4(1), 3–22,
2004.
12. Deci, E. L., Ryan, R. M., and Aronson, J., The paradox of achievement: The harder you push,
the worse it gets, Improving Academic Achievement: Impact of Psychological Factors on Education,
Academic Press, Orlando, FL, 2002, pp. 61–87.
13. Dweck, C. S. and Aronson, J., Messages that motivate: How praise molds students’ beliefs,
motivation, and performance (in surprising ways), in Aronson, J. (ed.), Improving Academic
Achievement: Impact of Psychological Factors on Education, Academic Press, New York, 2002.
14. Stein, N. L., Hernandez, M. W., Trabasso, T., Lewis, M., Haviland-Jones, J. M., and Barrett, L. F.,
Advances in modeling emotions and thought: The importance of developmental, online, and
multilevel analysis, Handbook of Emotions, Guilford Press, New York, 2008, pp. 574–586.
15. Keller, J. M. and Reigeluth, C. M., Motivational design of instruction, Instructional-Design
Theories and Models: An Overview of Their Current Status, Lawrence Erlbaum Associates Hillsdale,
NJ, 1983, pp. 383–434.
16. Ames, C., Classrooms: Goals, structures, and student motivation, Journal of Educational
Psychology 84(3), 261–271, 1992.
17. Rosiek, J., Emotional scaffolding: An exploration of the teacher knowledge at the intersection of
student emotion and the subject matter, Journal of Teacher Education 54(5), 399–412, 2003.
18. Qu, L., Wang, N., and Johnson, L., Using learner focus of attention to detect learner motivation
factors, in Proceedings of the User Modeling Conference 2005, Edinburgh, U.K., 2005, pp. 70–73.
19. Beck, J., Engagement tracing: Using response times to model student disengagement, in
Proceedings of the 2005 conference on Artificial Intelligence in Education: Supporting Learning through
Intelligent and Socially Informed Technology, Amsterdam, the Netherlands, 2005, pp. 88–95.
20. Johns, J. and Woolf, P., A dynamic mixture model to detect student motivation and proficiency,
in AAAI, Boston, MA, pp. 163–168, 2006.
21. de Baker, R. S. J., Corbett, A., Roll, I., and Koedinger, K., Developing a generalizable detector of
when students game the system, User Modeling and User-Adapted Interaction 18(3), 287–314, 2008.
22. Walonoski, J. and Heffernan, N., Detection and analysis of off-task gaming behavior in intelli-
gent tutoring systems, in Proceedings of the 8th Conference on Intelligent Tutoring Systems, Jhongli,
Taiwan, 2006, pp. 382–391.
23. Arroyo, I. and Woolf, B., Inferring learning and attitudes with a Bayesian Network of log files
data, in Looi, C. K., McCalla, G., Bredeweg, B., and Breuker, J (eds.) Proceedings of the Artificial
Intelligence in Education: Supporting Learning through Intelligent and Socially Informed Technology
(AIED-2005 Conference), July 18–22, 2005, Amsterdam, the Netherlands, IOS Press, Amsterda,
the Netherlands, 2005, pp. 33–40.
24. Cocea, M. and Weibelzahl, S., Eliciting motivation knowledge from log files towards motiva-
tion diagnosis for adaptive systems, in Proceedings of the 11th International Conference on User
Modeling 2007, Corfu, Greece, 2007, pp. 197–206.
25. Picard, R.W. and Scheirer, J., The galvactivator: A glove that senses and communicates skin
conductivity, in Proceedings of the 9th International Conference on HCI, New Orleans, LA,
pp. Â�1538–1542. 2001.
26. Pantic, M. and Rothkrantz, L., Toward an affect-sensitive multimodal human-computer inter-
action, Proceedings of the IEEE 91(9), 1370–1390, 2003.
27. Zeng, Z., Pantic, M., Roisman, G., and Huang, T., A survey of affect recognition methods:
Audio, visual, and spontaneous expressions, IEEE Transactions on Pattern Analysis and Machine
Intelligence 31(1), 39–58, 2009.
28. Kapoor, A., Picard, R. W., and Ivanov, Y., Probabilistic combination of multiple modalities
to detect interest, in International Conference on Pattern Recognition, Cambridge, U.K., 2004,
pp. 969–972.
29. Messom, C. H., Sarrafzadeh, A., Johnson, M. J., and Chao, F., Affective state estimation from
facial images using neural networks and fuzzy logic, in Wang, D. and Lee, N. K. (eds.), Neural
Networks Applications in Information Technology and Web Engineering, Borneo Publications,
2005.
Modeling Affect by Mining Students’ Interactions within Learning Environments 243
30. Litman, D. J., Recognizing student emotions and attitudes on the basis of utterances in spoken
tutoring dialogues with both human and computer tutors, Speech communication 28(5), 559–590,
2006.
31. D’Mello, S., Graesser, A., and Picard, R. W., Toward an affect-sensitive AutoTutor, Intelligent
Systems, IEEE 22(4), 53–61, 2007.
32. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., and Stolcke, A., Prosody-based automatic detec-
tion of annoyance and frustration in human-computer dialog, in Proceedings of the International
Conference on Spoken Language Processing, Vol 3, Denver, CO, 2002, pp. 2037–2039.
33. Forbes-Riley, K. and Litman, D. J., Predicting emotion in spoken dialogue from multiple knowl-
edge sources, in Proceedings of Human Language Technology Conference of the North American Chapter
of the Association for Computational Linguistics (HLT/NAACL), Boston, MA, 2004, pp. 201–208.
34. Liscombe, J., Riccardi, G., and Hakkani-Tür, D., Using context to improve emotion detection
in spoken dialog systems, in Ninth European Conference on Speech Communication and Technology
(EUROSPEECH’05), Lisbon, Portugal, 2005, pp. 1845–1848.
35. Barrett, F., Are emotions natural kinds?, Perspectives on Psychological Science 1, 28–58, 2006.
36. Aviezer, H., Ran, H., Ryan, J., Grady, C., Susskind, J. M., Anderson, A. K., and Moscovitch, M.,
Angry, disgusted or afraid?, Studies on the Malleability of Facial Expression Perception 19(7), 724–
732, 2008.
37. Russell, J. A., Core affect and the psychological construction of emotion, Psychological Review
110(1), 145–172, 2003.
38. Stemmler, G., Heldmann, M., Pauls, C. A., and Scherer, T., Constraints for emotion specificity in
fear and anger: The context counts, Psychophysiology 38(2), 275–291, 2001.
39. D’Mello, S. K., Craig, S. D., Sullins, C. J., and Graesser, A. C., Predicting affective states expressed
through an emote-aloud procedure from AutoTutor’s mixed initiative dialogue, International
Journal of Artificial Intelligence in Education 16(1), 3–28, 2006.
40. Graesser, A. C., Chipman, P., Haynes, B. C., and Olney, A., AutoTutor: An intelligent tutoring
system with mixed-initiative dialogue, IEEE Transactions on Education 48(4), 612–618, 2005.
41. Ericsson, K. A. and Simon, H. A., Protocol Analysis: Verbal Reports as Data, MIT Press, Cambridge,
MA, 1993.
42. Trabasso, T. and Magliano, J. P., Conscious understanding during comprehension, Discourse
Processes 21(3), 255–287, 1996.
43. Baker, R., Rodrigo, M., and Xolocotzin, U., The dynamics of affective transitions in simulation
problem-solving environments, in Paiva, A. P. R. P. R. W. (ed.), 2nd International Conference on
Affective Computing and Intelligent Interaction 2007, Lisbon, Portugal, 666–677.
44. Conati, C., Chabbal, R., and Maclaren, H., A Study on using biometric sensors for monitoring
user emotions in educational games, in Workshop on Assessing and Adapting to User Attitudes and
Affect: Why, When and How? in conjunction with User Modeling (UM-03), Johnstown, PA, 2003.
45. Mavrikis, M., Maciocia, A., and Lee, J., Towards predictive modelling of student affect from
web-based interactions, in Luckin, R., Koedinger, K., and Greer, J. (eds.), Proceedings of the
13th International Conference on Artificial Intelligence in Education,: Building Technology Rich
Learning Contexts that Work (AIED2007), Vol. 158, Los Angeles, CA, IOS Press, Amsterdam, the
Netherlands, 2007, pp. 169–176.
46. Mavrikis, M., Modelling students’ behaviour and affective states in ILEs through educational
data mining, PhD thesis, The University of Edinburgh, Edinburgh, U.K., 2008.
47. Dimitrova, V., Mizoguchi, R., du Boulay, B., and Graesser, A. (eds.), Proceedings of the 14th
International Conference on Artificial Intelligence in Education Building Learning Systems that Care:
From Knowledge Representation to Affective Modelling (AIED 2009), Vol. 200, July 6–10, 2009,
Brighton, U.K., IOS Press, Amsterdam, the Netherlands, 2009.
48. D’Mello, S., Craig, S., and Graesser, A., Multimethod assessment of affective experience and
expression during deep learning, International Journal of Learning Technology 4(3/4), 165–187, 2009.
49. Baker, R. S., Corbett, A. T., Koedinger, K. R., and Wagner, A. Z., Off-task behavior in the �cognitive
tutor classroom: When students “game the system,” in Proceedings of ACM CHI 2004: Computer-
Human Interaction, Vienna, Austria, 2004, pp. 383–390.
244 Handbook of Educational Data Mining
50. Graesser, A. C., McDaniel, B., Chipman, P., Witherspoon, A., D’Mello, S., and Gholson, B.,
Detection of emotions during learning with AutoTutor, in Proceedings of the 28th Annual
Conference of the Cognitive Science Society, Mahwah, NJ, 2006, pp. 285–290.
51. Landauer, T. and Dumais, S., A Solution to plato’s problem: The latent semantic analysis the-
ory of acquisition, induction, and representation of knowledge, Psychological Review 104(2),
211–240,
� 1997.
52. D’Mello, S., Craig, S., Witherspoon, A., McDaniel, B., and Graesser, A., Automatic detection of
learner’s affect from conversational cues, User Modeling and User-Adapted Interaction 18(1–2),
45–80, 2008.
53. Mavrikis, M. and Maciocia, A., WALLIS: A Web-based ILE for science and engineering students
studying mathematics, in Workshop of Advanced Technologies for Mathematics Education in 11th
International Conference on Artificial Intelligence in Education, Sydney, Australia, 2003.
54. Lohr, S., Sampling: Design and Analysis, Duxbury Press, Pacific Grove, CA, 1999.
55. Witten, I. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan
Kaufmann, San Francisco, CA, 2005.
56. D’Mello, S., Taylor, R., Davidson, K., and Graesser, A., Self versus teacher judgments of learner
emotions during a tutoring session with AutoTutor, in Woolf, B. P., Aimeur, E., Nkambou, R.,
and Lajoie, S. (eds.) Proceedings of the 9th International Conference on Intelligent Tutoring Systems,
Montreal, Canada, 2008, pp. 9–18.
57. Porayska-Pomsta, K., Influence of situational context on language production: Modelling
teachers’ corrective responses, PhD thesis, School of Informatics, The University of Edinburgh,
Edinburgh, U.K., 2003.
58. Jones, E. and Nisbett, R., The Actor and the Observer: Divergent Perceptions of the Causes of Behavior,
General Learning Press, New York, 1971.
59. Williams, G. J., Inducing and Combining Decision Structures for Expert Systems, PhD thesis, The
Australian National University, Canberra, Australia, 1990.
60. Vannoorenberghe, P., On aggregating belief decision trees, Information Fusion 5(3), 179–188,
2004.
61. Ehrig, M. and Sure, Y., Ontology mapping—An integrated approach, in European Semantic Web
Symposium (ESWS), Heraklion, Greece, 2004, pp. 76–91.
62. Klein, M., Combining and relating ontologies: An analysis of problems and solutions, in Perez,
G., Gruninger, M., Stuckenschmidt, H., and Uschold, M. (eds.), Workshop on Ontologies and
Information Sharing (IJCAI’01), Seattle, WA, 2001.
63. Agarwal, P., Huang, Y., and Dimitrova, V., Formal approach to reconciliation of individual
ontologies for personalisation of geospatial semantic web, in Rodriguez, M.A (ed.), Proceedings
of GeoS 2005, LNCS 3799, Mexico, 2005, pp. 195–210.
64. Arroyo, A., Denaux, R., Dimitrova, V., and Pye, M., Interactive ontology-based user knowl-
edge acquisition: A case study, in Sure, Y. and Dominguez, J. (eds.), Semantic Web: Research
and Applications, Proceedings of the Third European Semantic Web Conference (ESWC 2006), Budva,
Montenegro, 2006, pp. 560–574.
17
Measuring Correlation of Strong Symmetric
Association Rules in Educational Data
Contents
17.1 Introduction......................................................................................................................... 245
17.1.1 Association Rules Obtained with Logic-ITA...................................................... 246
17.1.1.1 Association Rules and Associated Concepts....................................... 246
17.1.1.2 Data from Logic-ITA................................................................................ 247
17.1.1.3 Association Rules Obtained with Logic-ITA....................................... 248
17.2 Measuring Interestingness................................................................................................ 249
17.2.1 Some Measures of Interestingness....................................................................... 249
17.2.2 How These Measures Perform on Our Datasets................................................ 251
17.2.3 Contrast Rules......................................................................................................... 253
17.2.4 Pedagogical Use of the Association Rules...........................................................254
17.3 Conclusions..........................................................................................................................254
References...................................................................................................................................... 255
17.1╇ Introduction
Association rules are very useful in educational data mining since they extract associa-
tions between educational items and present the results in an intuitive form to the teach-
ers. They can be used for a range of purposes: In [1,2], they are used, combined with other
methods, to personalize students’ recommendation while browsing the Web. In [15], they
were used to find various associations of students’ behavior in the Web-based educational
system LON-CAPA. The work in [4] used fuzzy rules in a personalized e-learning mate-
rial recommender system to discover associations between students’ requirements and
learning materials. They were used in [5] to find mistakes often made together while stu-
dents solve exercises in propositional logic, in order to provide proactive feedback and
understand underlying learning difficulties. In [6], they were combined with genetic pro-
gramming to discover relations between knowledge levels, times, and scores that help the
teacher modify the course’s original structure and content.
Like any other data mining technique, association rules require that one understands
the data well: What is the content and the structure of the data? Does it need cleaning?
Does it need transformation? Which attributes and values should be used to extract the
association rules? Do the results, here the association rules, make sense or, in other words,
can we rate them? Can the results be deployed to improve teaching and learning? Unlike
other data mining techniques, there is mainly one algorithm to extract association rules
245
246 Handbook of Educational Data Mining
from data (a priori, [7]). In comparison with a classification task, for example, there are
many classifiers that, with the same set of data, use different algorithms and thus can give
different results [3].
A pitfall in applying association rules regards the selection of “interesting” rules among
all extracted rules. Let us clarify that the level of interestingness of a rule as measured by
an objective measure describes how the associated items are correlated, not how useful the
rule ends up being for the task. Indeed, the algorithm, depending on the thresholds given
for the support and the confidence, which we will describe in the next section, can extract a
large amount of rules, and the task of filtering the meaningful ones can be arduous. This is
a common concern for which a range of objective measures exist, depending on the context
[8,9]. However, these measures can give conflicting results about whether a rule should be
retained or discarded. In such cases, how do we know which measure to follow?
We explore in this paper several objective measures in the context of our data in order
to understand which are better suited for it. We extracted association rules from the data
stored by the Logic-ITA, an intelligent tutoring system for formal proof in propositional
logic [10]. Our aim was to find out whether some mistakes often occurred together or
sequentially during practice. The results gave strong symmetric associations between three
mistakes. Strong means that all associations had a strong support and a strong confidence.
Symmetric means that Xâ•›→â•›Y and Yâ•›→â•›X were both associations extracted. Puzzlingly, mea-
sures of interestingness such as lift, correlation, or chi-square indicated poor or no corre-
lation. However cosine, Jaccard, and all-confidence were systematically high, implying a
high correlation between the mistakes. In this chapter, we investigate why measures such
as lift, correlation, and chi-square, to some extent, work poorly with our data, and we show
that our data has quite a special shape. Further, chi-square on larger datasets with the
same properties gives an interesting perspective on our rules. We also show that cosine,
Jaccard, and all-confidence rate our rules as interesting. These latter measures, in opposi-
tion to the former measures, have the null-invariant property. This means that they are
not sensitive to transactions that do not contain neither X nor Y. This fact is relevant when
items are not symmetric, in the sense that the information conveyed by the presence of X
is more important than the information conveyed by its absence, which is the case with the
data of Logic-ITA and quite often the case with educational data in general. A preliminary
version of this work has appeared in [11].
An association rule is a rule of the form Xâ•›→â•›Y, where X and Y are disjoint subsets of I hav-
ing a support and a confidence above a minimum threshold.
Support:
where |A| denotes the cardinality of the set A. In other words, the support of a rule
Xâ•›→â•›Y is the proportion of transactions that contain both X and Y. This is also called
P(X, Y), the probability that a transaction contains both X and Y. Support is symmetric:
sup(Xâ•›→â•›Y)â•›=â•›sup(Yâ•›→â•›X).
Confidence:
In other words, the confidence of a rule Xâ•›→â•›Y is the proportion of transactions that
�contain both X and Y among those that contain X. An equivalent definition is
P (X, Y )
conf ( X → Y ) =
P (X )
with
{ti : X ∈ti }
P (X ) = ,
n
which is the probability that a transaction contains Y knowing that it already contains X.
Confidence is not symmetric. Usually conf(Xâ•›→â•›Y) is different from conf(Yâ•›→â•›X).
Each of the two above measures plays a role in the construction of the rules. Support
makes sure that only items occurring often enough in the data will be taken into account
to establish the association rules. Confidence makes sure that the occurrence of X implies
in some sense the occurrence of Y.
Symmetric association rule: We call a rule Xâ•›→â•›Y a symmetric association rule if sup(Xâ•›→â•›Y)
is above a given minimum threshold and both conf(Xâ•›→â•›Y) and conf(Yâ•›→â•›X) are above
a given minimum threshold. This is the kind of association rules we obtained with the
Logic-ITA.
premises. For this, the student has to construct new formulas, step by step, using logic
rules and formulas previously established in the proof, until the conclusion is derived.
There is no unique solution and any valid path is acceptable. Steps are checked on the fly
and, if incorrect, an error message and possibly a tip are displayed. Students used the tool
at their own discretion. A consequence is that there is neither a fixed number nor a fixed
set of exercises done by all students.
All steps, whether correct or not, are stored for each user, and each attempted exercise as
well as their error messages. A very interesting task was to analyze the mistakes made and
try to detect associations within them. This is why we used association rules. We defined
the set of items I as the set of possible mistakes or error messages and a transaction as the
set of mistakes made by one student on one exercise. Therefore, we obtain as many trans-
actions as exercises attempted with the Logic-ITA during the semester, which is about
2000. Data did not need to be cleaned but put in the proper form to extract association
rules: a file containing a list of transactions, i.e., a list of mistakes per attempted exercise,
was created from the database stored by Logic-ITA.
TABLE 17.1
Some Association Rules for Year 2004
M11 ==> M12 [sup: 77%, conf: 89%]
M12 ==> M11 [sup: 77%, conf: 87%] M10: Premise set incorrect
M11 ==> M10 [sup: 74%, conf: 86%] M11: Rule can be applied, but deduction
incorrect
M10 ==> M11 [sup: 78%, conf: 93%] M12: Wrong number of line references given
M12 ==> M10 [sup: 78%, conf: 89%]
M10 ==> M12 [sup: 74%, conf: 88%]
Measuring Correlation of Strong Symmetric Association Rules in Educational Data 249
conf ( X → Y )
• lift ( X → Y ) = .
P (Y )
An equivalent definition is
P (X , Y )
.
P(X )P(Y )
Lift is a symmetric measure for rules of the form Xâ•›→â•›Y. A lift well above 1 indi-
cates a strong correlation between X and Y. A lift around 1 says that P(X, Y)â•›=â•›P(X)
P(Y). In terms of probability, this means that the occurrence of X and the occur-
rence of Y in the same transaction are independent events, hence X and Y are
not correlated. Lift(Xâ•›→â•›Y) can be seen as the summary AV(Xâ•›→â•›Y) and AV(Yâ•›→â•›X),
where AV is another objective measure called Added Value [13].
P ( X , Y ) − P ( X ) P (Y )
• Correlation ( X → Y ) = .
P ( X ) P ( Y ) (1 − P ( X ) ) (1 − P ( Y ) )
Correlation is a symmetric measure and is a straightforward application of
Pearson correlation to association rules when X and Y are interpreted as vectors:
Xâ•›∪ (respectively Y) is a vector of dimension n; coordinate i takes value 1 if transac-
tion ti contains X (respectively Y) and takes value 0 otherwise. A correlation around
0 indicates that X and Y are not correlated, a negative figure indicates that X and Y
are negatively correlated, and a value close to 1 that they are positively correlated.
Note that the denominator of the division is positive and smaller than 1. Thus, the
absolute value |cor(Xâ•›→â•›Y)| is greater than |P(X, Y) − P(X)P(Y)|. In other words, if
the lift is around 1, correlation can still be significantly different from 0.
• Chi-square. To perform the chi-square test, a table of expected frequencies is
first calculated using P(X) and P(Y) from the contingency table. A contingency
table summarizes the number of transactions that contain X and Y, X but not
250 Handbook of Educational Data Mining
Y, Y but not X, and finally that contain neither X nor TABLE 17.2
Y. The expected frequency for (X∩Y) is given by the A Contingency Table
product nP(X)P(Y). Performing a grand total over
X ¬X Total
observed frequencies versus expected frequencies
Y 500 50 550
gives a number, which we denote by chi. Consider
¬Y 50 1400 1450
the contingency table shown in Table 17.2. The X
Total 550 1450 2000
column reads as follows: 500 transactions contain
X and Y, while 50 transactions contain X but not Y.
Altogether there are 550 transactions that contain X TABLE 17.3
and the total number of transactions is 2000. The fol-
Expected Frequencies for
lowing column reads similarly for the transactions
Table 17.2
that do not contain X.
Xe ¬Xe Total
550
Here P ( X ) = P (Y ) = . Ye 151.25 398.75 550
2000
¬Ye 398.75 1051.25 1450
Therefore the expected frequency (Xe∩Ye) is Total 550 1450 2000
550 × 550
= 151.25,
2000
The obtained number chi is compared with a cut-off value read from a chi-square
table. For the probability value of 0.05 with one degree of freedom, the cut-off
value is 3.84. If chi is greater than 3.84, then X and Y are regarded as correlated
with a 95% confidence level. Otherwise they are regarded as noncorrelated
also with a 95% confidence level. Therefore, in our example, X and Y are highly
correlated.
P ( X,Y )
• Cosine ( X → Y ) = .
P ( X ) P (Y )
An equivalent definition is
{ti : X , Y ∈ti }
Cosine (X → Y ) = .
{ti : X ∈ti } {ti : Y ∈ti }
X, Y
• Jaccard ( X → Y ) = ,
X + Y − X, Y
where
|X, Y| is the number of transactions that contain both X and Y
|X| is the number of transactions that contain X
|Y| is the number of transactions that contain Y
TABLE 17.4
Contingency Tables Giving Symmetric Rules with Strong Confidence
X, Y X, ¬Y ¬X, Y ¬X, ¬Y Pictorial Representation
S1 500 50 50 1,400
S2 1,340 300 300 60
S3 1,340 270 330 60
S4 1,340 200 400 60
S5 1,340 0 0 660
S6 2,000 0 0 0
S7 13,400 3,000 3,000 600
S8 13,400 2,700 3,300 600
S9 13,400 2,000 4,000 600
solutions in which all mistakes from set X were made but no mistake from set Y, and so
on. For the set S3, for example, 1340 solutions contain both all mistakes from set X and all
mistakes from set Y, 270 contain all mistakes from set X but no mistake from Y, 330 contain
all mistakes from Y but no mistakes from X, and 60 attempted solutions contain neither
mistakes from X nor mistakes from Y. The last three lines, S7 to S9, are the same as S2 to S4
with a multiplying factor of 10. To help visualizing the differences of distribution between
the first six datasets, we included a pictorial representation of the distributions in the last
column. The last three datasets have the same distribution as S2, S3, and S4, and they are
not shown again.
For each of these datasets, we calculated the various measures of interestingness we
exposed earlier. Results are shown in Table 17.5. Expected frequencies are calculated
assuming the independence of X and Y. Note that expected frequencies coincide with
TABLE 17.5
Measures for All Contingency Tables
conf(Xâ•›→â•›Y)
sup conf(Yâ•›→â•›X) lift Corr Chi cos Jac. All − c.
S1 0.25 0.90 3.31 0.87 1522.88 0.91 0.83 0.91
S2 0.67 0.82 1.00 −0.02 0.53 0.82 0.69 0.82
0.82
S3 0.67 0.83 1.00 −0.01 0.44 0.82 0.69 0.8
0.82
S4 0.67 0.87 1.00 0 0,00 0.82 0.69 0.77
0.77
S5 0.67 1.00 1.49 1 2000 1 1 1
1.00
S6 1.00 1.00 1.00 — — 1 1 1
1.00
S7 0.67 0.82 1.00 −0.02 5.29 0.82 0.69 0.82
0.82
S8 0.67 0.83 1.00 −0.01 4.37 0.82 0.69 0.8
0.80
S9 0.67 0.87 1.00 0 0.01 0.82 0.69 0.77
0.77
Measuring Correlation of Strong Symmetric Association Rules in Educational Data 253
observed frequencies for S6, so chi-square cannot be calculated. We have put in bold the
results that indicate a positive dependency between X and Y. We have also highlighted the
lines for S3 and S4, representing our data from the Logic-ITA and, in a lighter shade, S8
and S9, which have the same characteristics but with a multiplying factor of 10.
Our results are aligned with the ones of [8] for lift, cosine, and Jaccard: these measures
confirm that the occurrence of X implies the occurrence of Y, as seen in the last three
�columns of Table 17.5.
However they disagree for correlation, as shown in column “Corr” of Table 17.5. Except
for S1 and S5, the correlation measure indicates a poor relationship between X and Y. Note
that P(X) and P(Y) are positive numbers smaller than 1, hence their product is smaller than
P(X) and P(Y). If P(X, Y) is significantly smaller than P(X) and P(Y), the difference between
the product P(X)P(Y) and P(X, Y) is very small, and, as a result, correlation is around 0.
This is exactly what happens with our data, and this fact leads to a strong difference with
[8]’s E1, E2, and E3 sets, where the correlation was highly ranked: except for S1 and S5, our
correlation results are around 0.
Finally, chi-square and all-confidence are not considered in [8]. It is well known that
chi-square is not invariant under the row-column scaling property, as opposed to all the
other measures that yield the same results as shown for S2 and S7, S3 and S8, and S4 and
S9. Chi-square rates X and Y as independent for S2 and S3, but rates them as dependent in
S7 and S8. As the numbers increase, the chi-square finds increasing dependency between
the variables. Due to a change in the curriculum, we were not able to collect and mine
association rules over more years. However, one can make the following projection: with a
similar trend over a few more years, one would obtain set similar to S8 and S9. Chi-square
would rate X and Y as correlated when X and Y are symmetric enough as for S3 and S8.
All-confidence always rate X and Y as correlated as cosine and Jaccard do. These three
measures have the null-invariant property. These measures are particularly well-suited
for nonsymmetric items, nonsymmetric in the sense that it is more important to be
aware of the presence of item X than of its absence. This is actually the case of the asso-
ciation rules obtained with the Logic-ITA. We are looking for information concerning
the occurrence of mistakes, not for their nonoccurrence. Therefore, these measures are
better suited to our data than the lift, for example, and the rules should be interpreted
accordingly.
These rules give complementary information allowing to better judge on the dependency
of X and Y. They tell us that from the attempted solutions not containing mistake X, 85%
of them contain mistake Y, while from the attempted solutions containing mistake X, only
17% do not contain mistake Y. Furthermore, only 3% of the attempted solutions contain
254 Handbook of Educational Data Mining
neither mistake X nor mistake Y. The neighborhood {¬Yâ•›→â•›X, Yâ•›→â•›¬X, ¬Yâ•›→â•›¬X} behaves
similarly, supporting the hypothesis that X and Y are positively correlated.
17.3╇ Conclusions
In this chapter, we investigated the interestingness of the association rules found in the
data from the Logic-ITA, an intelligent tutoring system for propositional logic. We used
this data-mining technique to look for mistakes often made together while solving an
exercise, and found strong rules associating three specific mistakes.
Taking an inquisitive look at our data, it turns out that it has quite a special shape. First,
it gives strong symmetric association rules. Second, P(X) and P(Y), the proportion of exer-
cises where mistake X was made and the proportion of exercises where mistake Y was
made, respectively, is significantly higher than P(X, Y), the proportion of exercises where
both mistakes were made. A consequence is that many interestingness measures resting
on probabilities or statistics such as lift, correlation, or even chi-square to a certain extent
rate X and Y as noncorrelated. However cosine, Jaccard, or all-confidence, which have the
null-invariant property and thus are not sensitive to transactions containing neither X nor
Y, rate X and Y as positively correlated. Further, we observe that mining associations on
data cumulated over several years could lead to a positive correlation with the chi-square
test. Finally, contrast rules give interesting complementary information: rules not contain-
ing any mistake or making only one mistake are very weak. This is further investigated
in [18]. The use of these rules to change parts of our course seemed to contribute to better
Measuring Correlation of Strong Symmetric Association Rules in Educational Data 255
learning as we have observed an increase of the marks in the final exam as well as an
increase of completely finished exercises with Logic-ITA.
This really indicates that the notion of interestingness is very sensitive to the context.
Since educational data often has relatively small number of instances, measures based on
statistical correlation may have to be handled with care for this domain.
We come to a similar conclusion as in [19]: the interestingness of a rule should be first
measured by measures with the null-invariant property such as cosine, Jacquard, or all-
confidence, then with a measure from the probability field such as lift if the first ones
rated the rule as uninteresting. In case of conflict between the two types of measures, the
user needs to take into account the intuitive information provided by each measure and
decide upon it. In particular, if knowing the presence of X is more important that knowing
its absence, then it is best to follow an interestingness measure having the null-invariant
property like cosine; if not, then it is better to follow an interestingness measure based on
statistics like lift.
As a further thought, in an educational context, is it important to consider only objective
interestingness measures to filter associations? When the rule Xâ•›→â•›Y is found, the prag-
matically oriented teacher will first look at the support: in our case, it showed that over
60% of the exercises contained at least three different mistakes. This is a good reason to
ponder. The analysis of whether these three mistakes are correlated with some objective
measure is in fact not necessarily relevant to the remedial actions the teacher will take,
and may even be better judged by the teacher. Therefore, we think that further subjec-
tive measures or criteria such as actionability should also be taken into account to filter
associations.
References
1. Wang, F., On using Data Mining for browsing log analysis in learning environments. In Data
Mining in E-Learning. Series: Advances in Management Information, C. Romero and S. Ventura,
editors. WIT Press, Southampton, U.K., pp. 57–75, 2006.
2. Wang, F.-H. and H.-M. Shao, Effective personalized recommendation based on time-framed nav-
igation clustering and association mining. Expert Systems with Applications 27(3): 365–377, 2004.
3. Minaei-Bidgoli, B., D.A. Kashy, G. Kortemeyer, and W.F. Punch, Predicting student per-
formance: an application of data mining methods with the educational web-based system
�LON-CAPA. In ASEE/IEEE Frontiers in Education Conference, IEEE, Boulder, CO, 2003.
4. Lu, J., Personalized e-learning material recommender system. In International Conference on
Information Technology for Application (ICITA’04), Harbin, China, pp. 374–379, 2004.
5. Merceron, A. and K. Yacef, Mining student data captured from a Web-based tutoring tool:
Initial exploration and results. Journal of Interactive Learning Research (JILR) 15(4): 319–346, 2004.
6. Romero, C., S. Ventura, C. de Castro, W. Hall, and M.H. Ng, Using genetic algorithms for
data mining in Web-based educational hypermedia systems. In Adaptive Systems for Web-based
Education, Malaga, Spain, May 2002.
7. Agrawal, R. and R. Srikant, Fast algorithms for mining association rules. In VLDB, Santiago,
Chile, 1994.
8. Tan, P.N., V. Kumar, and J. Srivastava, Selecting the right interestingness measure for asso-
ciation patterns. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, San Francisco, CA, pp. 67–76, 2001.
256 Handbook of Educational Data Mining
9. Brijs, T., K. Vanhoof, and G. Wets, Defining interestingness for association rules. International
Journal of Information Theories and Applications 10(4): 370–376, 2003.
10. Yacef, K., The Logic-ITA in the classroom: a medium scale experiment. International Journal on
Artificial Intelligence in Education 15: 41–60, 2005.
11. Merceron, A. and K. Yacef, Revisiting interestingness of strong symmetric association rules in
educational data. In International Workshop on Applying Data Mining in e-Learning (ADML’07),
Crete, Greece, 2007.
12. Tan, P.N., M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson Education, Boston,
MA, 2006.
13. Merceron, A. and K. Yacef, Interestingness measures for association rules in educational data.
In International Conference on Educational Data Mining, R. Baker and J. Beck, editors, Montreal,
Canada, pp. 57–66, 2008.
14. Tan, P.N., V. Kumar, and J. Srivastava, Selecting the right interestingness measure for asso-
ciation patterns. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, San Francisco, CA, August 26–29, 2001.
15. Minaei-Bidgoli, T.B., P-N., and W.F. Punch, Mining interesting contrast rules for a Web-based
educational system. In International Conference on Machine Learning Applications (ICMLA 2004),
Louisville, KY, December 16–18, 2004.
16. Merceron, A. and K. Yacef, A Web-based tutoring tool with mining facilities to improve learn-
ing and teaching. In 11th International Conference on Artificial Intelligence in Education, F. Verdejo
and U. Hoppe, editors, IOS Press, Sydney, Australia, pp. 201–208, 2003.
17. Merceron, A. and K. Yacef, Educational data mining: A case study. In Artificial Intelligence in
Education (AIED2005), C.-K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, editors. IOS Press,
Amsterdam, the Netherlands, pp. 467–474, 2005.
18. Merceron, A., Strong symmetric association rules and interestingness measures. In Advances in
Data Warehousing and Mining (ADWM) Book Series, Rare Association Rule Mining and Knowledge
Discovery: Technologies for Infrequent and Critical Event Detection, Y.S. Koh and N. Rountree, edi-
tors. IGI Global, Hershey, PA, 2009.
19. Han, J. and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco,
CA, 2001.
18
Data Mining for Contextual Educational
Recommendation and Evaluation Strategies
Contents
18.1 Introduction......................................................................................................................... 257
18.2 Data Mining in Educational Recommendation.............................................................. 258
18.2.1 Non-Multidimensional Paper Recommendation............................................... 258
18.2.2 Contextual Recommendation with Multidimensional
Nearest-Neighbor Approach................................................................................. 259
18.3 Contextual Paper Recommendation with Multidimensional
Nearest-Neighbor Approach............................................................................................. 261
18.4 Empirical Studies and Results.......................................................................................... 264
18.4.1 Data Collection........................................................................................................ 265
18.4.2 Evaluation Results.................................................................................................. 265
18.4.3 Discussions.............................................................................................................. 267
18.4.4 Implication of the Pedagogical Paper Recommender........................................ 269
18.5 Concluding Remarks.......................................................................................................... 270
References...................................................................................................................................... 271
18.1╇ Introduction
When information overload intensifies, users are overwhelmed by the information pour-
ing out from various sources, including the Internet, and are usually confused by which
information should be consumed; that is, users find it difficult to pick something appro-
priate when the number of choices increases. Fortunately, a recommender system offers a
feasible solution to this problem. For example, if a user explicitly indicates that he or she
favors action movies starring Sean Penn, then he or she could be recommended movies like
The Interpreter. In this case, the system is able to match user preferences to content features
of the movies, which is a content-based filtering approach. In another major recommenda-
tion approach called collaborative filtering, the system constructs a group of like-minded
users with whom the target user shares similar interests and makes recommendations
based on an analysis of them.
For learners engaging in senior-level courses, tutors, in many cases, would like to pick
some articles as supplementary reading materials for them each week. Unlike research-
ers “Googling” research papers matching their interests from the Internet, tutors, when
making recommendations, should consider the course syllabus and their assessment of
learners along many dimensions. As such, simply Googling articles from the Internet is
257
258 Handbook of Educational Data Mining
far from enough. Suppose, a paper recommender system can carefully assess and compare
both learner and candidate paper characteristics (through instructions from tutors), and
make recommendations accordingly. In other words, learner models of each individual,
including their learning interest, knowledge, goals, etc., will be created. Paper models will
also be created based on the topic, degree of peer recommendation, etc. The recommenda-
tion is carried out by matching the learner characteristics with the paper topics to achieve
appropriate pedagogical goals such as “the technical level of the paper should not impede
the learner in understanding it.” Therefore, the suitability of a paper for a learner is calcu-
lated in terms of the appropriateness of it to help the learner in general. This type of recom-
mendation system is called a pedagogical paper recommender. In this chapter, we mainly
discuss the potentials of data mining techniques in making personalized recommenda-
tions in the e-learning domain and highlight the importance of conducting �appropriate
evaluations.
The organization of this chapter is as follows. In the next section, a discussion on educa-
tional recommendation techniques is presented. Section 18.3 focuses on system design and
architecture; key recommendation techniques are also presented. Section 18.4 includes the
empirical studies conducted and discusses the results. Implications of our study are also
pointed out. Section 18.5 concludes this chapter.
* CiteSeer is a publicly available paper searching tool, and can be accessed at: http://citeseer.ist.psu.edu/cs
Data Mining for Contextual Educational Recommendation and Evaluation Strategies 259
only matching the interests of the target users, without probing into the details of why
the users like it, or how the users will like it. For example, a user does not like animated
movies, but likes to watch them during nonworking days with his or her kids. Therefore,
he or she should be recommended The Incredibles on Saturdays and Sundays. As another
example, Joe normally does not like romantic comedy movies, especially those starring
Meg Ryan; but he will be willing and happy to watch one during holidays with his wife
Sarah, who enjoys movies starring Meg Ryan (of any genre). Thus, on weekends, You’ve Got
Mail can be recommended to Joe.
In the e-learning domain, a learner does not like software testing in general, but because
he or she is taking a class on software engineering, and he or she is expecting three credits
to complete the class, he or she should be recommended an article on software testing. In
these cases, incorporating contextual information is very important and helpful in inform-
ing the recommender to provide high quality recommendations to users because they vary
in their decision making based on the “usage situation, the use of the good or service (for
family, for gift, for self) and purchase situation” (Lilien et al. [7]). For instance, customers’
seasonal buying patterns are classical examples of how customers change their buying hab-
its based on the situation. A context-aware recommender can provide a smart shopping
environment, since the recommendations are location-aware, that is, of the recommended
shopping date/time, location of the stores, etc. As such, a shopper can receive personal-
ized shopping recommendations in the stores of the neighborhood where he or she is.
Adomavicius et al. [8] argue that dimensions of contextual information can include when,
how, and with whom the users will consume the recommended items, which, therefore,
directly affect users’ satisfaction toward the system performance. In particular, the recom-
mendation space now consists of not only item and user dimensions, but also many other
contextual dimensions, such as location, time, and so on.* An example on user profile and
interests could be as follows: John, age 25, enjoys watching action movies in theatres during
holidays. Hence, recommendations can be of the following form: John is recommended The
Departed during a Saturday night show at UA IFC mall. To deal with the multidimensional
CF, data warehouse and OLAP application concepts drawn from database approaches are
proposed [8]. Essentially, we have to transform the multidimensional CF into traditional
2D recommendations. Simply put, using our previous 3D-CF examples, we can first elimi-
nate the Time dimension by only considering votes delivered on weekdays from the rating
database. The resulting problem becomes the traditional 2D users vs. items CF case. In fact,
from a data warehousing perspective, this approach is similar to a slicing operation on a
multidimensional database. The rationale behind the slicing operation is straightforward:
if we only want to predict whether a user will prefer to, say, watch a movie on weekdays,
we should only consider the historical “weekday” ratings for this purpose.
Pazzani [10] also studied an earlier “version” of multidimensional CF through the aggre-
gation of users’ demographic information such as their gender, age, education, address,
etc. There are a number of ways to obtain demographic data either through explicit ways
such as questionnaires or implicit ways such as analyzing their behavioral data (purchas-
ing data). For instance, from users’ browsing behaviors, we can easily know where the
users come from; hence, the recommendations become location-aware, which is widely
used especially for recommendations to be served on mobile devices. In order to make
predictions to a target user, the demographic-based CF [10] learns a relationship between
each item and the type of people who tend to like it. Then, out of “that” type of people, the
* There are no agreed terms on what to call these additional aspects in recommender systems. Names that have
been used include contextual information [8], lifestyle [9], and demographic data [10].
CF identifies the neighbors for the target user, and makes recommendations accordingly.
Clearly, the difference between traditional CF and demographically based CF is the pre-
processing step of grouping similar users.
The most recent effort in incorporating context information in making a recommenda-
tion is a study by Lekakos and Giaglis [9], in which users’ lifestyle is considered. Lifestyle
includes users’ living and spending patterns, which are in turn affected by some external
factors such as culture, family, etc., and internal factors such as their personality, emo-
tions, attitudes, etc. The system will first compute the Pearson correlation of users’ lifestyles
to relate one user to another. In particular, the closeness between users is measured in
terms of their lifestyle instead of ratings in traditional CF: the chance that users with the
same lifestyle tend to have similar tastes will be higher. Based on it, the system will select
those users who score above a certain threshold. After this filtering process, the system
will make predictions on items for the target user based on ratings from neighbors. This
approach is essentially similar to that in [10], which is to make use of the additional infor-
mation (i.e., lifestyle, demography) to determine the closeness between users.
Peer recommendation
Value-addedness
FIGURE 18.1
An illustration of users’ relevance evaluation of a paper in the pedagogical paper recommender. (From Tang,
T. Y., and McCalla, G.I. A multi-dimensional paper recommender, in Proceedings of the 13th International Conference
on Artificial Intelligence in Education (AIED 2007), Marina Del Rey, CA, 2007.)
The possibility that our learners may encounter difficulty in understanding the papers has
pushed us to consider their background knowledge, and hence to adopt content-based and
hybrid filtering techniques that incorporate this factor.
With respect to the numerous purposes of recommendation, a tutor may aim at the
learners’ overall satisfaction (the highest possible overall ratings), or to stimulate learn-
ers’ interest only (the highest possible interest ratings), or to help the learners to gain new
information only (the highest possible value-added ratings), etc. Given the fact of numer-
ous recommendation purposes, it is imperative and appealing to collect multidimensional
ratings and to study multidimensional CF that can utilize the ratings. Table 18.1 lists out
the factors considered in the multidimensional CF.
TABLE 18.1
Factors That Are Considered in Our Multidimensional CF
Dimension Recommendation Type Factors
3D Collaborative filtering Overall rating, value-addedness, peer recommendations
5D User-model-based Overall rating, value-addedness, peer recommendations,
collaborative filtering learner interest, learner background knowledge
Popularity- Hybrid filtering Overall rating, value-addedness, peer recommendations,
incorporated 3D paper popularity r̃
Popularity- Hybrid filtering Overall rating, value-addedness, peer recommendations,
incorporated 5D learner interest, learner background knowledge, paper
popularity r̃
where
ri,k is the rating by user i on co-rated item k on dimension d
Pd (a, b) is the Pearson correlation based on the rating r on dimension d
Pd ( a, b) =
∑ (r − r )(r − r )
K
a, k a b,k b
(18.2)
∑ (r − r ) ∑ (r − r )
K
a, k a
2
K
b, k b
2
Note that woverallâ•›+â•›wvalueaddâ•›+â•›wpeer_recâ•›=â•›1, and the values are determined by the tutor manu-
ally or the system automatically. In our experiment, these weights are tuned manually fol-
lowing a series of trials, in which those weights reported in this chapter are representative
ones. In our experiments, the number of neighbors used by this method is also set to be
from 2 to 15, where they are selected according to the weighted sum Pearson correlation
P3D(a, b) in formula (18.1). After we identify the closest neighbors for a target user a, we
then calculate the aggregate rating of each paper to make recommendations. The selection
method is by calculating the sum of weighted ratings by all closest neighbors
rj 3 D = ∑P 3D ( a, b)rb , j (18.3)
B
where
rj3 D is the aggregate rating to paper j from all neighbors in set B
P3D(a, b) is the weighted Pearson correlation between target user a and his or her
�neighbor b
rb,j is the rating given to paper j by neighbor b
After we calculate rj3 D for all papers, we can find the best paper(s) for recommendation,
that is, paper(s) with the highest rj3 D .
By combining Pearson correlations of 3D rating-based CF and 2D Pearson correlation
between learners based on their student models, P2D(a, b) in a linear form, we have five-
dimensional collaborative filtering method (denoted as 5D-CF). That is, we compute the
aggregated Pearson correlation in this method as follows:
TABLE 18.2
Comparison of Our Approach with Other Multidimensional CF Approaches
Type of Additional Information Method of Finding Neighbors for a Target User
Adomavicius • Users’ demographic data • Users’ overall rating toward each item
et al. (2005)
• Item information
• Consuming information
Pazzani (1999) • Users’ demographic data • Learned user profile based on content
information
• Content information • Users’ overall rating toward each item
Lekakos and • Users’ demographic data • Learned user profile based on content
Giaglis (2006) information
• Lifestyle data • Users’ overall rating toward each item
Our approach • User models • Learned user profile based on content
information
• Paper features • Users’ overall rating toward each item
• The popularity of each paper
where
w~r is the weight of linear combination and is the control variable in our experiment
n is the number of neighbors in CF, that is, nâ•›=â•›|B|
Our reason for combining CF with the nonpersonalized method is to remedy the low
accuracy of CF, especially when the number of co-rated papers is low, that is, by consider-
ing the “authority” of a given paper according to people who have rated it: if the paper
has been well received, then the “authority” level of it is high. That is, if a given paper is
popular among all the users, then the target user might also be interested in it. Although
this popular paper cannot be used to differentiate the different tastes of users, it is worth
recommending (Table 18.2).
FIGURE 18.2
Paper feedback form.
FIGURE 18.3
The paper recommender user interface.
TABLE 18.3
Average Overall Ratings Obtained from Various Recommendation Methods,
Where |K| Is the Number of Co-Rated Papers
Average Overall Ratings When Top-n Papers
Are Recommended, nâ•›=â•›{1, 3, 5}
Methods Top 1 Top 3 Top 5
Best-case benchmark (Popularity only) 3.167 3.055 2.992
Worst-case benchmark (random) 2.708 2.708 2.708
3D-CF (|K|â•›=â•›2) 2.946 2.884 2.851
3D-CF (|K|â•›=â•›4) 3.105 2.984 2.934
3D-CF (|K|â•›=â•›8) 3.210 (p = .34) 3.085 (p = .31) 3.007 (p = .38)
5D-CF ((0.8overall, 0.1valueadd, 0.1peer_rec), 3.131 3.064 3.011 (p = .35)
w2Dâ•›=â•›1, |K|â•›=â•›8)
5D-CF ((1overall, 0valueadd, 0peer_rec), w2Dâ•›=â•›1, 3.146 3.068 3.015 (p = .32)
|K|â•›=â•›8)
Pop3D (|K|â•›=â•›2) 3.160 3.047 2.995
Pop3D (|K|â•›=â•›4) 3.160 3.071 (p = .40) 2.995
Pop3D (|K|â•›=â•›8) 3.160 3.088 (p = .29) 2.995
Pop5D (|K|â•›=â•›2) 3.081 3.035 2.969
Pop5D (|K|â•›=â•›4) 3.129 3.067 2.997
Pop5D (|K|â•›=â•›8) 3.158 3.099 (p = .23) 3.019 (p = .29)
Ceiling 3.560 3.347 3.248
not statistically significant (p-value ≥ 0.22), but most results better than the worst-case
�benchmark are �significantly better (p-value╛<╛0.05, not shown here).
For 3D-CF, when the number of co-rated papers increases, the system has done well in
making recommendations, among them, the best result is when the system is making the
top 1 recommendation. Compared to the performances of 3D-CF, the other three methods
are less satisfactory. The reason is that when the quantity of information injected into the
system increases, it might not help or even bring up noises that may reduce the effective-
ness of finding close neighbors. This phenomenon might not be true in the traditional
recommendation systems where a large number of data is present. In our domain, the
number of papers and students are limited, therefore, injecting just-enough information
into the system might generate quality recommendations. Nevertheless, we speculate that
when the database increases, multidimensional filtering techniques are expected to shine.
18.4.3╇ Discussions
Due to the limited number of students, papers, and other learning restrictions, a tutor can-
not require students to read too many papers in order to stuff the database. As such, the
majority of typical recommender systems might not work well in the pedagogical domain.
Through experimental studies and prototypical analysis, we draw a number of impor-
tant conclusions regarding the design and evaluation of these techniques in our domain.
Although our studies help peel our knowledge, more studies are needed to further our
understanding of it.
For instance, we realized that one of the biggest challenges is the difficulty of testing
the effectiveness or appropriateness of a recommendation method due to a low number
of available ratings. Testing the method with more students, say, in two or three more
semesters, may not be helpful, because the results are still not enough to draw conclusions
as strong as those from other domains where the ratings can be as many as millions. We
are also eager to see the collaborations from different institutions in using the system in a
more distributed and larger-scale fashion (as it is very difficult to achieve it in using one
class each time and in one institution). Through it, our future work includes the design
of a movie lens–like benchmark database as a test-bed on which more algorithms can be
tested (including ours).
Meanwhile, under those environments where a wider range of learning scenarios exist,
evaluations can be performed in a more systematic way; for instance, to continuously
ask students to engage in “on-line reevaluation,” the reevaluation is by using a strategy
similar to the all-but-one by hiding the ratings of a target user on each round through
running all applicable methods on existing data (known ratings). The idea is as follows:
suppose we have some ratings from the previous year’s learners and we want to recom-
mend papers to new learners. Suppose we have several applicable methods but we do not
know which one is the best. This situation may arise when we have collected enough data
(ratings) in the middle or near the end of the class, or when we want to use this prototype
for classes other than software engineering. Then we may pick a learner from the previ-
ous year (i.e., an old learner), whose ratings are known, to test these methods. If any of the
methods is superior to the rest in making recommendation, in terms of the ratings given
by the old learner, then the method may also be the best for making a recommendation
to the new learner.
We can also perform more task-oriented evaluations, such as evaluating the appropriate-
ness of recommendations through a group-oriented task, where learners will be asked to
design and implement an automatic tool for, say, a family calendar. In this case, learners
would be required to write a research paper documenting their experiences as well as
lessons learned from the task. Both the after-project questionnaire and the analysis of the
report could help to evaluate the actual “uses” of the recommended papers for learning.
Another task can be asking learners to conduct a research study on a topic of their choice
in 3 weeks, say, the adoption of the CASE tool for software testing and present their work
in front of the class. In another example, where multiple institutions are involved, we may
divide students into groups from different institutions (or even different countries), and
each group is then required to undertake a distributed software project, where a number
of CASE tools can be used to coordinate their actions. In this case, reference articles can
be supplied weekly as in our settings, though more papers can be included. Then, by the
end of their project, learners will be asked to not only turn in a workable system, but also
document what they have learned in a research paper.
Then, in each of these tasks, learners will be asked to evaluate the pedagogical useful-
ness of the recommended papers in, among others: how the papers can help their under-
standing of the lecture topics, how the knowledge they learned can be used to guide their
joint software development effort, and how the papers can be used as “seed” ones to help
them “Google” more.
We believe that the task-oriented evaluation framework is more appropriate than purely
presenting the traditional metrics such as precision, recall, etc., in directly assessing user
satisfaction and acceptance over the recommended items. In spite of it, we think that the
factors we have considered so far (e.g., interestingness and value-addedness) represent the
most typical factors that need to be taken into consideration when making recommenda-
tions in the pedagogical domain. The issues and conclusions we suggest can enlighten
future studies on it, and finally further our understanding in making recommendations
in this domain.
Tutors Learners
Pedagogical-purposed
information
Clusters of
learners with
DB4 similar goals,
ratings, etc.
Intelligent data processor
Clusters of paper
usages, for novice
DB3 learners, experts,
etc.
Clusters of learner Clusters of papers
behaviors, reading with similar reading
DB1 DB2
patterns, etc. patterns, ratings, etc.
Design-purposed info.
FIGURE 18.4
Paper annotations with temporal sequences of user models.
270 Handbook of Educational Data Mining
and recommendations made accordingly [15]. For instance, patterns could be found such
as learning objects that are highly rated by learners with a deep understanding of some
knowledge, by learners with unusual tastes, etc. The sequence of a learner’s interactions
with various learning objects forms a “foot print” of that learner’s activity in the system
[11,16].
The research-paper recommender system has been one of the core inspirations for the
ecological approach, the learner centered, adaptive, reactive, and collaborative learning
framework, proposed in [15]. Specifically, in the ecological approach, it is assumed that
there are a large number of learning objects (research papers, web learning resources,
online quiz banks, etc.) and a number of different applications that support learners
including learning object recommenders (similar to the paper recommender), collaborative
activities such as reading, editing, expert and expert finders, and so on. When a learner
interacts with a learning object, the object is “annotated” with an instance of the learning
model. After a learner has socialized with other learners, corresponding information can
also be tagged. These tagged pieces of information embedded in a learner-model instance
can include the following:
• Features about the learner, including cognitive, and social characteristics and
most importantly their goal(s) in making the encounters.
• Learner feedback on the information content of the learning object, including its
information and cognitive quality. For instance, the information quality such as
the content accuracy, and up-to-dateness; the cognitive quality such as the appro-
priateness of the object for learners “like” him or her and the efficacy of the object
with respect to their goal(s) in accessing the object, etc.
Over time, each learning object slowly accumulates learner-model instances that collec-
tively form a record of the experiences of all sorts of learners as they have interacted with
the learning object as well as their various interactions.
In sum, then, the ecological approach highlights the vision that information gradually
accumulates about learning objects, the interactions, and the users. These pieces of infor-
mation capture “on-the-fly” information about various activities in the system and there-
fore should be interpreted in the context of end use during the active learning modeling
[17]. Gradually, through mechanisms like natural selection, the system will determine
what information is useful and what is not, for which purpose.
approaches, might better help in predicting which papers would be useful and suitable
to the current topic. For instance, in a syllabus, it is stated that the students have taken an
introductory software engineering course and a programming language course before; in
other words, the knowledge background of students is known.
It is our hope that the studies we initiated here can open up opportunities for research-
ers to probe into the use of automated social tools to support active learning and
�teaching in future networked learning environments [18], where the flow of knowledge will
be governed by the speed of human to human interaction (p. 179) as well as human to system
interaction.
References
1. Basu, C., Hirsh, H., Cohen, W., and Nevill-Manning, C. (2001). Technical paper recommen-
dations: A study in combining multiple information sources. Journal of Artificial Intelligence
Research (JAIR), 1: 231–252.
2. Bollacker, K., Lawrence, S., and Giles, C. L. (1999). A system for automatic personalized track-
ing of scientific literature on the web. In Proceedings of IEEE/ACM Joint Conference on Digital
Libraries (ACM/IEEE JCDL’1999), Berkeley, CA, pp. 105–113.
3. Woodruff, A., Gossweiler, R., Pitkow, J., Chi, E., and Card, S. (2000). Enhancing a digital book
with a reading recommender. In Proceedings of ACM Conference on Human Factors in Computing
Systems (ACM CHI’00), the Hague, the Netherlands, pp. 153–160.
4. McNee, S., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S., Rashid, A., Konstan, J. and Riedl, J.
(2002). On the recommending of citations for research papers. In Proceedings of ACM Conference
on Computer Supported Collaborative Work (CSCW’02), New Orleans, LA, pp. 116–125.
5. Torres, R., McNee, S. M., Abel, M., Konstan, J. A., and Riedl, J. (2004). Enhancing digital librar-
ies with TechLens. In Proceedings of IEEE/ACM Joint Conference on Digital Libraries (ACM/IEEE
JCDL’2004), Tuscon, AZ, pp. 228–236.
6. Middleton, S. E., Shadbolt, N. R., and De Roure, D. C. (2004). Ontological user profiling in rec-
ommender systems. ACM Transactions on Information Systems, 22(1): 54–88, January 2004.
7. Lilien, G. L., Kotler, P., and Moorthy, S. K. (1992). Marketing Models. Prentice Hall, Englewood
Cliffs, NJ, pp. 22–23.
8. Adomavicius, G., Sankarayanan, R., Sen, S., and Tuzhilin, A. (2005). Incorporating contextual
information in recommender systems using a multidimensional approach. ACM Transactions on
Information Systems, 23(1): 103–145. January 2005.
9. Lekakos, G. and Giaglis, G. (2006). Improving the prediction accuracy of recommendation
algorithms: Approaches anchored on human factors. Interacting with Computers, 18(3): 410–431,
2006.
10. Pazzani, M. (1999). A framework for collaborative, content-based, and demographic filtering.
Artificial Intelligence Review, 13: 393–408, December 1999.
11. Tang, T. Y. and McCalla, G. I. (2007). A multidimensional paper recommender. In Proceedings
of the 13th International Conference on Artificial Intelligence in Education (AIED 2007), Marina Del
Rey, CA.
12. Tang, T. Y. (2008). The design and study of pedagogical paper recommendation. PhD thesis.
Department of Computer Science, University of Saskatchewan, May 2008.
13. Tang, T. Y. and McCalla, G. I. (2009). A multidimensional paper recommender: Experiments
and evaluations. IEEE Internet Computing, 13(4): 34–41, IEEE Press.
14. Tang, T. Y. and McCalla, G. I. (2005). Paper annotations with learner models. In Proceedings of the
12th International Conference on Artificial Intelligence in Education (AIED 2005), Amsterdam, the
Netherlands, pp. 654–661.
272 Handbook of Educational Data Mining
15. McCalla, G. I. (2004). The ecological approach to the design of e-learning environments:
Purpose-based capture and use of information about learners. Journal of Interactive Media in
Education (JIME), 7, Special issue on the educational semantic web.
16. Tang, T. Y. and McCalla, G. I. (2006). Active, context-dependent, data centered techniques for
e-learning: A case study of a research paper recommender system. In C. Romero and S. Ventura
(Eds.), Data Mining in E-Learning, WIT Press, Southampton, U.K., pp. 97–116.
17. Vassileva, J., McCalla, G. I., and Greer, J. (2003). Multi-agent multiuser modeling in I-Help.
Journal of User Modeling and User-Adapted Interaction, 13: 179–210, Springer.
18. McCalla, G. I. (2000). The fragmentation of culture, learning, teaching and technology:
Implications for the artificial intelligence in education research agenda in 2010. Special
Millennium Issue on AIED in 2010, International Journal of Artificial Intelligence in Education, 11:
177–196.
19
Link Recommendation in E-Learning Systems
Based on Content-Based Student Profiles
Contents
19.1 Introduction......................................................................................................................... 273
19.2 Related Works...................................................................................................................... 274
19.3 Recommendation Approach............................................................................................. 276
19.3.1 Capturing Learning Experiences......................................................................... 276
19.3.2 Learning Content-Based Profiles.......................................................................... 277
19.3.3 Detecting Active Interests..................................................................................... 280
19.3.4 Context-Aware Recommendation......................................................................... 281
19.4 Case Study........................................................................................................................... 283
19.5 Conclusions..........................................................................................................................284
References...................................................................................................................................... 285
19.1╇ Introduction
E-learning systems offer students an opportunity to engage in an interactive learning pro-
cess in which they can interact with each other, teachers, and the learning material. Most
of these systems allow students to navigate the available material organized based on the
criteria of teachers and, possibly, the student learning style and background. However, the
Web is an immense source of information that can be used to enrich the learning process
of students and, thus, expand the imparted knowledge and acquired skills provided by the
E-learning systems. In order to exploit the information available on the Web in a fruitful
way, material should be carefully selected and presented to students in the proper time not
to overload them with irrelevant information.
Many approaches have emerged in the past few years taking advantage of data mining
(DM) and recommendation technologies to suggest relevant material to students accord-
ing to their needs, preferences, or behaviors. These techniques analyze logs of student
behaviors, their individual characteristics, behavior and learning styles, and other data to
discover patterns of behavior regarding content in the system that are later applied to per-
sonalize the student’s interaction with the system. Learning of association rules, induction
of classifiers, and data clustering have been used in several works to enhance the effective-
ness in the presentation and navigation of content in E-learning systems, consequently
enriching the learning experience of students.
273
274 Handbook of Educational Data Mining
with the target user and make predictions based on the aggregation of the rating given
by these users. Conversely, model-based algorithms learn a model for the prediction of
ratings. Typically, the model building process is performed by different ML and DM algo-
rithms such as Bayesian networks, clustering, and rule-based approaches.
In several works, clustering algorithms have been employed to group students with
similar navigation behavior or learning sessions and learn group profiles characterizing
each cluster properties. An example is presented in [2], in which unsupervised learning
is used to automatically identify common learning behaviors and then train a classi-
fier user model that can inform an adaptive component of an intelligent learning envi-
ronment. Talavera and Gaudioso [19] present experiments using clustering techniques
to automatically discover useful groups from students to obtain profiles of student
behaviors.
Other examples of model-based recommendation approaches employed in E-learning
systems use association rules, sequential learning, or Bayesian networks. Shen and Shen
[18] organize contents into small atomic units called learning objects and use rules to
guide the learning resource recommendation service based on simple sequencing specifi-
cation. In [9], the extraction of sequential patterns has been used to find patterns of char-
acteristic behaviors of students that are used for recommending relevant concepts to these
students in an adaptive hypermedia educational system. The discovery of association
rules in the activity logs of students has been used in an agent assisting the online learn-
ing process [23] to recommend online learning activities or shortcuts on a course Web
site. In an opposite direction, mining of association rules has also been used to discover
patterns of usage in courses [3]. This approach aims to help professors to improve the
structure and contents of an E-learning course. Likewise, rules denoting dependence rela-
tionships among the usage data during student sessions to improve courses [15]. In [12],
sequential pattern mining algorithms are applied to discover the most used path by stu-
dents and recommend links to the new students as they browse the course. Hämäläinen
et al. [7] construct a Bayesian network to describe the learning process of students in order
to classify them according to their skills and characteristics and offer guidance according
to their particularities.
Hybrid schemes combining recommendation methods belonging to different recom-
mendation approaches are frequently used to better exploit their advantages. For exam-
ple, clusters of students with similar learning characteristics are created in the research
paper recommender system presented in [20] as a previous step to CF with the goal of
scaling down the candidate users to those clustered together and obtained more person-
alized recommendation. CF is based on the ratings or feedback given by students toward
the recommended papers. In [4], association rule mining is used to discover interesting
information in usage data registered by students in the form of IF–THEN recommenda-
tion rules. Then, a collaborative recommender system is applied to share and score the
recommendation rules obtained by teachers with similar profiles along with other experts
in education. A personalized recommendation method integrating user clustering and
association mining techniques is presented in [22]. This work propose a novel clustering
algorithm to group users based on their historical navigation sessions partitioned accord-
ing to specific time intervals. Association rules are then mined from navigation sessions
of the groups to establish a recommendation model for similar students in the future.
Romero et al. [16] cluster students showing common behavior and knowledge, and then
try to discover sequential patterns in each cluster for generating personalized recommen-
dation links.
276 Handbook of Educational Data Mining
Learning
Browsing the Active material
learning material session
Active
interests
Student
Context Web
Recommendations detection
FIGURE 19.1
Overview of the recommendation approach.
Link Recommendation in E-Learning Systems Based on Content-Based Student Profiles 277
read in an online Web-based course are captured and considered different �learning
�experiences of this student. In other words, a student profile is composed of a set of learn-
ing experiences Eâ•›=â•›〈e1, e2, …, el 〉 hierarchically organized in a set of categories. From a
content analysis of these experiences comes out the knowledge to be modeled in student
profiles.
Each experience encapsulates both the specific and contextual knowledge that describes
a particular situation denoting a student reading of a certain piece of information.
Experiences can be divided into three main parts: the description of the Web page content,
the description of the associated contextual information, and the outcome of applying the
experience for suggesting information. The first part enables content-based recommenda-
tion by discovering and retrieving topical related information, whereas the second and
third parts allow the recommendation method to take into consideration the current stu-
dent contexts and determine a level of confidence in recommendations generated using
each individual experience.
The content of Web pages read by student as part of a course is represented according to
the vector space model (VSM) [17]. In this model, each document is identified by a feature
vector in a space in which each dimension corresponds to a distinct term associated with
a numerical value or weight which indicates its importance. The resulting representation
of a Web page and the corresponding learning experience is, therefore, equivalent to a
t-dimensional vector:
ei = ( t1 , w1 ) , … , ( tt , wt )
where wi represents the weight of the term ti in the page or document di. In a previous step,
noninformative words such as prepositions, conjunctions, pronouns, and very common
verbs, commonly referred to as stop-words, are removed using a standard stop-word list
and a stemming algorithm is applied to the remaining words in order to reduce the mor-
phological variants of words to their common roots [13].
Learning experiences also describe the contextual information of the situation in which
the Web page was captured, including the page address, date and time it was registered,
and the level of interest the user showed in the page according to some preestablished
criteria.
In the description of a learning experience, the patterns of received feedback regard-
ing suggestions given based on the knowledge provided by them are also kept track of.
Basically, each experience ei has an associated relevance reli, which is a function of the ini-
tial interest of the experience, the number of successful and failed recommendations made
based on this experience, and the time that passes from the moment in which the experi-
ence was captured or used for the last time. Figure 19.2 shows an example of a learning
experience. Then, the user profile consists of pairs 〈ei, reli 〉, where ei is the user experience
encoding mainly the Web page the user found interesting, and reli represents the evidence
about the user interest in that experience, confined to the [0,1] interval, according to the
collected evidence.
FIGURE 19.2
Example of a learning experience.
conceptual clustering) [6], with structures and procedures specifically designed for user
profiling on the Web. Modeling learning material using conceptual clustering allows
acquiring descriptions of the topics learned by students without their intervention through
observation of student activities.
WebDCC carries out incremental, unsupervised concept learning over the collected
experiences. Conceptual clustering includes not only clustering, but also characteriza-
tion, i.e., the formation of intentional concept descriptions for extensionally defined clus-
ters. It is defined as the task of, given a sequential presentation of experiences and their
associated descriptions, finding clusters that group these experiences into concepts or
categories, a summary description of each concept and a hierarchical organization of
them [21].
The advantage of using this algorithm is twofold. Incremental learning of student pro-
files allows acquiring and maintaining knowledge over time as well as deal with subject
areas that cannot be preestablished beforehand. Also, the result of characterization is a
readable description of the learning material the student read as a means of understanding
further information needs.
WebDCC algorithm takes the learning experiences captured through observation in an
online fashion. Experiences are analyzed to learn a conceptual description of their con-
tent and organized within the student profiles. Identification of categories or topics in the
material the student read is based on clustering of similar past experiences.
In document clustering, clusters are distinct groups of similar documents and it is gen-
erally assumed that clusters represent coherent topics. Leaves in the hierarchy correspond
to clusters of experiences belonging to all ancestor concepts. This is, clusters group highly
similar experiences observed by the algorithm so that a set of ni experiences or docu-
ments belonging to a concept ci and denoted by Eâ•›=â•›{e1, e2, …, eni} is organized into a collec-
tion of k clusters, Ujiâ•›=â•›{u1i, u2i, …, uki}, containing elements of Ei such that uliâ•›∩â•›upiâ•›=â•›ϕ, ∀ lâ•›≠â•›p.
As relevant experiences appear, they are assigned to clusters in the student profile. Each
experience can be incorporated to either some of the existent clusters or to a novel cluster
depending on its similarity with the current categories represented in the profile.
Hierarchies of concepts are classification trees in which internal nodes represent
�concepts and leaf nodes represent clusters of experiences. The root of the hierarchy corre-
sponds to the most general concept, which comprises all the experiences the algorithm has
seen, whereas inner concepts become increasingly specific as they are placed lower in the
Link Recommendation in E-Learning Systems Based on Content-Based Student Profiles 279
Recommendations
Active browsing session Web
FIGURE 19.3
Context-aware recommendation.
280 Handbook of Educational Data Mining
After incorporating a new learning experience, the current hierarchy is evaluated from a
structural point of view. In this evaluation, new concepts can be created or concepts can be
merged, split, or promoted. Concept formation is driven by the notion of conceptual cohe-
siveness. Highly cohesive clusters are assumed to contain similar experiences belonging
to a same category, whereas clusters exhibiting low cohesiveness are assumed to contain
experiences concerning distinctive aspects of more general categories. In the last case, a
concept summarizing the cluster is extracted, enabling a new partitioning of experiences
and the identification of subcategories. A merge operation takes two concepts and com-
bines them into a single one, whereas splitting takes place when a concept is no longer
useful to describe experiences in a category and then it can be removed. The promotion of
concepts to become siblings of their parent concepts is also taken into account in order to
place concepts at the appropriate hierarchical level.
document belongs to the AI category, but these terms are unlikely to be useful for either
a classifier at the same hierarchical level (e.g., Logic) or a classifier at the next hierarchical
level (e.g., Bayesian networks).
Linear classifiers, which embody an explicit or declarative representation of the
category
� based on which categorization decisions are taken, are applied in WebDCC
algorithm. Learning in linear classification consists in examining the training pages (i.e.,
pages read by the student) a finite number of times to construct a prototype for the cat-
egory, which is later compared against the pages to be classified. A prototype pci for
a category ci consists in a vector of weighted terms, pci = ( t1 , w1 ) , … , ( tp , w p ) , where wj
is the weight associated to the term tj in the category. This kind of classifiers are both
efficient, since classification is linear on the number of terms, documents and categories,
and easy to interpret, since it is assumed that terms with higher weights are considered
better predictors for the category than those with lower weights as can be observed in
the figure. WebDCC builds a hierarchical set of classifiers, each based on its own set of
relevant features, as a combined result of a feature selection algorithm for deciding on
the appropriate set of terms at each node in the tree and a supervised learning algorithm
for constructing a classifier for such node. Rocchio classifier is used in this algorithm to
train classifiers with βâ•›=â•›1 and γâ•›=â•›0, since no negative examples are available and there is
no initial query, yielding
pci =
1
ci ∑d
d∈ci
as a prototype for each class ci ∈ C. Hence, each prototype pci is the plain average of all
training pages belonging to the class ci, and the weight of each term is simply the aver-
age of its weights in positive pages of the category. To categorize a new page into a given
category, its closeness to the prototype vector of the category is computed using the stan-
dard cosine similarity measure [17]. The prototype vectors of categories are attached with
the classification threshold τ, which indicates the minimum similarity to the prototype
of each category pages should have in order to fall into these categories. The induction
of classifiers for hierarchical categories is carried out during the learning of the student
profile.
are detected as current interests and Web pages are searched on the Web about them. Thus,
recommendations are delivered to the student as the result of the Web search carried out.
In order to capture the current user context so that recommendations can be delivered
in the precise moment they are needed, a fixed-size sliding window is used over the active
session. For a sliding window of size n, the active session ensures that only the last n vis-
ited pages influence recommendation. In Figure 19.3, for example, a small sliding window
will allow to eliminate Bayesian networks as an active interest if the student continues read-
ing some more pages about Logic in a course.
In contrast to student profile learning, recommendation is an online process in which
the E-learning system needs to determine a set of candidate recommendations beforehand
or trigger a Web search. To gather potential learning material to recommend in advance,
an agent might perform a Web search to retrieve pages belonging to the concepts the stu-
dent is interested in during idle computer time. For example, the system can retrieve pages
from some fixed sites periodically (e.g., university Web pages) or find nearest neighbor
documents by using some experiences in the profile as query (e.g., asking “more docu-
ments like this” to specialized search engines such as Google Scholar).
In order to evaluate whether a candidate Web page resulting from the Web search should
be recommended to a given student, the system searches for similar learning experiences
in the profile of this student assuming that the interest in a new page will resemble the
interest in similar material read in the past. The existing categories in the profile bias the
search toward the most relevant experiences.
The comparison between previous learning experiences and Web pages to be recom-
mended is performed across a number of dimensions that describes them, the most
important being the one that measures the similarity between the item contents, which
is estimated using the cosine similarity. The n best experiences, Eâ•›=â•›{e1, e2, …, ek}, which
exceed a minimum similarity threshold, are selected to determine the convenience of
making a recommendation. To assess the confidence in recommending a candidate Web
page ri given the experiences in E, a weighted sum of the confidence value of each similar
retrieved experience is then calculated as follows:
∑ w * rel
n
k k
conf ( r ) = k =1
∑ w
i n
k
k =1
where n is the number of similar experiences retrieved, rel k is the relevance in the profile
of the experience ek and wk is the contribution of each experience according to its similarity.
This method to estimate the confidence in a recommendation is based on the well-known
distance-weighted nearest neighbor algorithm.
Each experience has a weight wk according to the inverse square of its distance from ri
Thus, the more similar and relevant pages are the more important for assessing the confi-
dence in a recommendation. If the confidence value of recommending rj is greater than a
certain confidence threshold, which can be customized in the E-learning system, the page
is recommended.
From the moment that users provide either explicit or implicit feedback about recom-
mendations, agents start learning from their actions. If the result of a recommendation is
successful, then the system learns from the success by increasing the relevance of the cor-
responding experiences in the profile and, possibly, incorporating new experiences. If the
result of a recommendation is a failure, the system learns from the mistake by decreasing
Link Recommendation in E-Learning Systems Based on Content-Based Student Profiles 283
the relevance of the experiences that has led to the unsuccessful recommendation or even
removing them.
FIGURE 19.4
Screenshot showing suggestions to students.
* www.e-unicen.edu.ar
284 Handbook of Educational Data Mining
100
Number of suggestions Precision
90
80
70
Percentage (%)
60
50
40
30
20
10
0
U1-S1 U1-S2 U1-S3 U2-S1 U2-S2 U3-S1 U3-S2 U3-S3 U3-S4 U4-S1 U4-S2
Units and subjects from the course
FIGURE 19.5
Experimental results in a case study.
Figure 19.5 shows the variation of precision in the generated recommendations as the
student advances in the course and the profile is learned. Precision measures the propor-
tion of relevant material recommended to the student out of the total of recommendations
made. A relevance judgment was established by manually visiting each suggested Web
page and assessing its importance regarding the reading material. For this experiment, we
limited the number of candidate Web pages retrieved from the Web to 100, i.e., we retrieve
the top 100 pages return by a search engine, and those recommendations exceeding a
confidence threshold of 50% were finally suggested. Both the percentage of recommended
pages as well as the precision achieved with these suggestions is depicted in the figure.
Although more experimentation is needed, it is possible to observe that a small number of
pages are recommended out of the top 100 Web pages (more pages can be found by extend-
ing the set of candidate pages analyzing pages placed lower in the search engine list of
results), this number is even smaller for more specific subjects. However, these recommen-
dations are made with a high level of precision. In addition, precision can be improved
even more by increasing the relevance threshold, which can be manually done in the user
interface.
19.5╇ Conclusions
In summary, this chapter presents a recommendation approach to suggest relevant learn-
ing material (e.g., Web pages, papers, etc.) to students interacting with an E-learning sys-
tem. This approach is based on building content-based profiles starting from observation
of student readings and behavior in the system offered courses. Whereas profiles provide
a characterization of the material accessed by a student and read during the learning pro-
cess, recommendations are based on the portion of such interests that are active a given
moment. Thus, relevant material is presented to students according to their current context
of activities given by the information they are actually reading in the system so that they
can enrich their knowledge about a subject by exploiting relevant material existing on the
Web and not only the one offered through the E-learning system.
Link Recommendation in E-Learning Systems Based on Content-Based Student Profiles 285
References
1. G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A sur-
vey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data
Engineering, 17(6):734–749, 2005.
2. S. Amershi and C. Conati. Unsupervised and supervised machine learning in user modeling for
intelligent learning environments. In Proceedings of the 2007 International Conference on Intelligent
User Interfaces. ACM Press, New York, 2007, pp. 72–81.
3. E. García, C. Romero, S. Ventura, and C. de Castro. Using rules discovery for the continu-
ous improvement of e-learning courses. In Intelligent Data Engineering and Automated Learning
IDEAL 2006, Volume 4224. LNCS, Burgos, Spain, 2006, pp. 887–895.
4. E. García, C. Romero, S. Ventura, and C. de Castro. An architecture for making recommen-
dations to courseware authors using association rule mining and collaborative filtering. User
Modeling and User-Adapted Interaction, 19(1–2):99–132, 2009.
5. P. García, S. Schiaffino, and A. Amandi. An enhanced bayesian model to detect students’ learn-
ing styles in Web-based courses. Journal of Computer Assisted Learning, 24(4):305–315, 2008.
6. D. Godoy and A. Amandi. Modeling user interests by conceptual clustering. Information
Systems, 31(4–5):247–265, 2006.
7. W. Hämäläinen, T. H. Laine, and E. Sutinen. Data mining in personalizing distance education
courses. In Data Mining in E-Learning. WIT Press, Southampton, U.K., 2005, pp. 157–172.
8. M. K. Khribi, M. Jemni, and O. Nasraoui. Automatic recommendations for e-learning person-
alization based on web usage mining techniques and information retrieval. In Proceedings of the
8th IEEE International Conference on Advanced Learning Technologies (ICALT’08). Santander, Spain,
July 1–5, 2008, pp. 241–245.
9. A. Ksristofic. Recommender system for adaptive hypermedia applications. In Proceeding of
Informatics and Information Technology Student Research Conference. Brisbane, Australia, 2005,
pp. 229–234.
10. D. Lemire, H. Boley, S. Mcgrath, and M. Ball. Collaborative filtering and inference rules for
context-aware learning object recommendation. International Journal of Interactive Technology and
Smart Education, 2(3):179–188, 2005.
11. S. Middleton, N. Shadbolt, and D. Roure. Ontological user profiling in recommender systems.
ACM Transactions on Information Systems, 22(1):54–88, 2004.
12. C. Romero Morales, A. R. Porras Pérez, S. Ventura Soto, C. Hervás Martínez, and A. Zafra
Gómez. Using sequential pattern mining for links recommendation in adaptive hypermedia
educational systems. In Current Developments in Technology-Assisted Education. Sevilla, Spain,
2006, pp. 1016–1020.
13. M. Porter. An algorithm for suffix stripping program. Program, 14(3):130–137, 1980.
14. J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART
Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs,
NJ, 1971, pp. 313–323.
15. C. Romero, S. Ventura, and P. De Bra. Knowledge discovery with genetic programming
for providing feedback to courseware authors. User Modeling and User-Adapted Interaction,
14(5):425–464, 2004.
16. C. Romero, S. Ventura, J. A. Delgado, and P. De Bra. Personalized links recommendation
based on data mining in adaptive educational hypermedia systems. In Creating New Learning
Experiences on a Global Scale, volume 4753. LNCS, Crete, Greece, 2007, pp. 292–306.
17. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications
of the ACM, 18:613–620, 1975.
18. L. Shen and R. Shen. Learning content recommendation service based-on simple sequencing
specification. In Advances in Web-Based Learning ICWL 2004, volume 3143. LNCS, Beijing, China,
August, 2004, pp. 363–370.
286 Handbook of Educational Data Mining
19. L. Talavera and E. Gaudioso. Mining student data to characterize similar behavior groups in
unstructured collaboration spaces. In Workshop on AI in CSCL. Valencia, Spain, 2004, pp. 17–23.
20. T. Tang and G. McCalla. Smart recommendation for an evolving E-learning system: Architecture
and experiment. International Journal on E-Learning, 4(1):105–129, 2005.
21. K. Thompson and P. Langley. Concept formation in structured domains. In D. Fisher, M. Pazzani,
and P. Langley, editors, Concept Formation: Knowledge and Experience in Unsupervised Learning.
Morgan Kaufmann, San Francisco, CA, 1991, pp. 127–161.
22. F-H. Wang and H-M. Shao. Effective personalized recommendation based on time-framed nav-
igation clustering and association mining. Expert Systems with Applications, 27(3):365–377, 2004.
23. O. R. Zaìane. Building a recommender agent for e-learning systems. In Proceedings of the 7th
International Conference on Computers in Education (ICCE’02). AACE, December 3–6, 2002, p. 55.
20
Log-Based Assessment of Motivation
in Online Learning
Contents
20.1 Introduction......................................................................................................................... 287
20.2 Motivation Measurement in Computer-Based Learning Configurations.................. 288
20.3 The Study............................................................................................................................. 289
20.3.1 The Learning Environments................................................................................. 289
20.3.2 Population................................................................................................................ 290
20.3.3 Log File Description............................................................................................... 290
20.3.4 Learnograms........................................................................................................... 290
20.3.5 Process...................................................................................................................... 290
20.3.6 Results...................................................................................................................... 291
20.3.6.1 Phase I—Constructing a Theory-Based Definition............................ 291
20.3.6.2 Phase II—Identifying Learning Variables............................................ 292
20.3.6.3 Phase III—Clustering the Variables Empirically................................ 293
20.3.6.4 Phase IV—Associating the Empirical Clusters with the
Theory-Based Definition......................................................................... 293
20.4 Discussion............................................................................................................................ 294
References...................................................................................................................................... 295
20.1╇ Introduction
It has been suggested that affective aspects play a critical role in effective learning and that
they are an important factor in explaining individual differences between learners’ behav-
ior and educational outcomes [1,2]. Emotional and/or affective states (e.g., motivation, anx-
iety, boredom, frustration, self-efficacy, and enthusiasm) are sometimes easily noticed in
the classroom (e.g., by facial expressions), but they are hard to measure and evaluate. This
becomes even harder when we try to assess affective aspects of learning in Web-based
or computer-based learning environments, which lack face-to-face interaction. However,
the traces constituted by students’ log file records offer us new possibilities to meet this
challenge. This chapter presents the development of a student motivation measuring tool
for Web-based or computer-based learning environments, where data is taken solely from
log files. This feature of our research is part of the emerging field of educational data
mining, which focuses on the use of data mining tools and techniques in order to answer
education-related questions.
287
288 Handbook of Educational Data Mining
expressions, skin conductivity) [14–16]; (4) analysis of the learner’s interaction with the
system, including dialogue analysis [17,18] and usage analysis [19–21].
Students’ motivation, which is part of their emotional and affective behavior, has been
suggested as one of the factors explaining individual differences in intensity and direction
of behavior [22]. While it is not easy to assess motivation in face-to-face learning situations,
it becomes even more challenging in computer-based learning, where there is no direct,
eye-to-eye contact between the instructor and the students. Therefore, developing ways to
detect motivation is especially challenging in such environments [23].
It is generally accepted that motivation is “an internal state or condition that serves
to activate or energize behavior and give it direction” [24]; sources of motivation can be
either internal (e.g., interestingness, enjoyment) or external (e.g., wishing for high grades,
fear of parental sanctions) to the person [25]. Motivational patterns, in addition to ability,
may influence the way people learn: whether they seek or avoid challenges, whether they
persist or withdraw when they encounter difficulties, or whether they use and develop
their skills effectively [26]. It has been shown that different motivational patterns relate
to different aspects of the learning process, e.g., achievement goals (performance or mas-
tery), time spent on tasks, performance [27–32]. Previous research on motivation, based on
learner–computer interaction data, examined some variables measuring, e.g., number of
pages read or tasks performed [21,33,34], response time [20,33,34], time spent on different
parts of the learning environment [21,33,34], speed of performance [9], and correctness (or
quality) of answers [9,20,34]. It is the intention of this study to develop a tool for measuring
motivation, based on log files solely.
the different modes of learning, students may mark each word/phrase as “well known,”
“not well known,” or “unknown.” In the memorizing and practicing modes, the system
presents the student only with those words that he or she didn’t mark as “known.”
20.3.2╇ Population
Log files of 2162 adults who used the online learning system during 1 month (April 2007)
were analyzed. After filtering nonactive students and 0-value cases, the research popula-
tion was reduced to Nâ•›=â•›674.
20.3.4╇ Learnograms
The Learnogram can be illuminated if we compare it with the medical clinic, where the elec-
trocardiogram (ECG), for instance, allows the cardiologist to examine a graphical display
of the patient’s heart-related parameters over time, thus allowing him or her to learn about
the patient’s cardiac functions without actually seeing the heart. Since we cannot always
observe the student while he or she is interacting with a computer-based (and particularly
a Web-based) learning environment, we may use a mechanism (i.e., log collection) that
continuously documents and graphically displays their learning-related activity, i.e., the
Learnogram. However, the behavioral variables that can be extracted from log files of learn-
ing systems have been only little researched in terms of their association with affective
aspects of learning. The process described in this chapter is another step toward bridg-
ing this gap. Learnogram-like representations appear in, e.g., Hwang and Wang [35], and
a case study in which Learnograms are used as a qualitative research tool is described in
Nachmias and Hershkovitz [36].
20.3.5╇ Process
The motivation measuring tool that this chapter presents was developed as part of a more
general study of affective states while learning in a Web-based or computer-based envi-
ronment. The framework consists of four consecutive phases. The first phase includes an
explicit and operational definition of the affective features in question; eventually, this
definition will be assessed in view of the empirical results. Next, empirical data will be col-
lected, reflecting students’ activity in the learning environment examined. These data are
to be analyzed qualitatively (during the second phase), in order to find relevant variables
Log-Based Assessment of Motivation in Online Learning 291
to measure the affective state, and then quantitatively (in the third phase), for clustering
according to similarity over large population. Finally, Phase IV associates the empirical
clusters with the theory-based definition. The result of this is a set of variables, whose
computation is based solely on the log files; at this stage, we also relate these variables to
the theoretical conceptualization. Below is a description of these phases:
File analysis, learnograms, and learning variable computations were all done using
MATLAB®. Clustering analysis was done using SPSS.
20.3.6╇ Results
The four-phase framework described in Section 20.3.5 was implemented in the online
learning environment investigated, in order to develop a log-based motivation measuring
tool. Following is the description of each of the phases.
what we have in mind is a more general notion; (2) direction—which refers to the way
�
motivation is preserved and oriented; (3) source of motivation (internal or external). Here
it is important to point out that although the variables by which motivation is measured
might be (almost) continuously evaluated, motivation—as the sum of many parameters—
should be measured over a period of time. Hence, engagement is considered an average
intensity, direction should describe the overall trend of the engagement level (e.g., increas-
ing, decreasing, stable, frequently changing), and source indicates the motivation’s ten-
dency to be either internal or external.
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65
2830
2264
1698
1132
566
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65
Day of the course
FIGURE 20.1
Four Learnograms for one student representing behavior over 65 days of activity in the researched learning
environments.
� The variables presented (from bottom to top): perceived knowledge, presence, pace, learning
modes.
Log-Based Assessment of Motivation in Online Learning 293
TABLE 20.1
The Variables Defined in Phase II
Variable Description Unit
timeOnTaskPC Total time of active sessions [min] divided by total time of logged data %
avgSession Average session duration min
avgActPace Average pace of activity within sessions; pace of activity per session is actions/min
the number of actions divided by the session duration
avgBtwnSessions Average time between sessions min
wordMarkPace Pace of word marking: Changed number of known words from words/min
beginning to end (can be negative) divided by total time of logged
data
examPC Percentage of exam-related activity: Number of exam actions divided %
by total number of actions
gamePC Percentage of game-related activity: Number of game actions divided %
by total number of actions
20.3.6.4 Phase IV—Associating the Empirical Clusters with the Theory-Based Definition
In our research, this phase is currently based only on literature review, and has not yet
been validated empirically. The variables, timeOnTask and avgSession, which form the
294 Handbook of Educational Data Mining
timeOnTaskPC
avgSession
examPC
gamePC
avgBtwnSessions
avgActPace
wordMarkPace
FIGURE 20.2
Dendrogram of the hierarchical clustering process.
first cluster, might be related to the extent of engagement, as it was previously suggested
that working time might be a measure for attention or engagement [21,26].
The variables, examPC and gamePC, grouped together in the second cluster, reflect
the student’s source of motivation; it may be reasonable to hypothesize—inspired by, e.g.,
[37,38]—that students who tend to take self-exams frequently (related to performance goal
orientation) have extrinsic motivation to learn, while those who tend to take game appli-
cations (related to learning goal orientation) are intrinsically motivated. The variables,
avgActPace and avgBtwnSessions, are also clustered together with the previous two, but
their closeness to Source of motivation is yet to be established.
The variable, wordMarkPace, indicating students’ word-marking speed, forms the third
cluster. According to a diagnostic rule found in de Vicente and Pain [9], high speed of activ-
ity together with high quality of performance (when staying in similarly-difficulty exer-
cises) suggests increasing motivation. Since an increase in the number of words marked
is an indication of the student’s perceived knowledge (i.e., a reflection of performance),
wordMarkPace might be related to the direction of motivation, i.e., direction.
20.4╇ Discussion
Many modes of delivery of online and computer-based learning exist (e.g., Intelligent
Tutoring System, educational software, virtual courses, Web-supported courses, and elec-
tronic books), all are mainly based on the interaction between student and system. While
using such an environment, learners leave continuous hidden traces of their activity in the
form of log file records, which document every action taken. These log files hold data that
can be analyzed in order to offer a better understanding of the learning process. Using
these traces for investigating the learners’ behavior has a great potential for serving dif-
ferent aspects of educational research [39,40]. Our goal in this chapter was to suggest an
empirically constructed tool for log-based measuring of motivation.
In many years of educational research, it has been shown that cognitive skills crucially
influence learning outcomes and achievements; however, the affective aspects of learn-
ing, which seriously determine this influence, must be understood and assessed. Such
assessment becomes a great challenge in learning situations which do not include face-
to-face interactions between instructor and learner. Being able to measure affect-related
Log-Based Assessment of Motivation in Online Learning 295
variables in such environments is of great importance for various parties involved: (a) the
instructors,
� who might identify and consequently address irregularities in his or her stu-
dents’ affective state; (b) learning system developers, who might integrate assessment and
intervention mechanisms in order to better fit the individual student’s learning needs; and
(c) researchers, who might extend the existing models of affective aspects of learning on
the basis of empirical evidence. Over all, the main beneficiary is the learner, whose effec-
tive learning constitutes the very heart of this process.
Examining motivational aspects of learning in e-learning systems, which hold a large
number of students, has a great potential, since this may tap otherwise unrecognized phe-
nomena (e.g., a constant decrease in students’ motivation in certain situations, high levels
of anxiety associated with certain topics or with certain courses). Validating the presented
results and scaling the variables suggested are crucial before the completion of the devel-
opment of our motivation measuring tool. The proper way of doing this is by an external
validation, i.e., finding the association between the variables found and independent vari-
ables measured by an external measuring tool for motivation. It is also possible to examine
the validation step by referring to a different learning environment; however, in this case,
a few preliminary steps are required, particularly, a replication of the clustering process,
in order to validate that the new system preserves the found clusters.
Furthermore, two major limitations are to be considered. First, variables were identified
in a specific learning environment; the measuring tool, hence, might be useful for similar
systems, but when using it in different environments (in terms of, e.g., learning domain,
instruction modes available), these variables should be converted, and their clustering
should be re-examined. Second, the tool might not be complete: we only focused on seven
variables but others might be considered. Identifying these variables from a segment of
the learning makes it possible to employ this tool during the learning process; in this way,
intervention when needed might be possible, and changes in motivation may be analyzed.
The data-mining-driven research of affective aspects of learning is only in its first stages
and we feel that a long way is still ahead. Among the many difficulties, one of the major
challenges is that of validating the measuring tools, and the different ways of doing so.
Developing automatic log-based measuring algorithms for the different aspects of the
learning process, is a core aim of our EduMining research group (http://edumining.info)
and we always seek new partners in brainstorming and further research.
References
1. Craig, S., Graesser, A., Sullins, J., and Gholson, B. (2004). Affect and learning: An exploratory look
into the role of affect in learning with AutoTutor. Learning, Media and Technology, 29(3), 241–250.
2. Kort, B., Reilly, R., and Picard, R. (2001). An affective model of the interplay between emotions
and learning. In Proceedings of IEEE International Conference on Advanced Learning Technologies.
Madison, WI, August 6–8, 2001.
3. Keller, J.M. (1983). Motivational design of instruction. In Instructional-Design Theories and
Models: An Overview of Their Current Status, C.M. Reigeluth (Ed.), pp. 383–434. Lawrence
Erlbaum Associates, Inc., Hillsdale, NJ.
4. Malone, T.W. and Lepper, M.R. (1987). Making learning fun: A taxonomy of intrinsic moti-
vations for learning. In Conative and Affective Process Analyses, Vol. 3. Aptitude, Learning, and
Instruction, R.E. Snow and M.J. Farr (Eds.), pp. 223–253. Lawrence Erlbaum Associates, Inc.,
Hillsdale, NJ.
296 Handbook of Educational Data Mining
5. Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper and Row,
New York.
6. Stein, N.L. and Levine, L.J. (1991). Making sense out of emotion. In Memories, Thoughts, and
Emotions: Essays in Honor of George Mandler, W. Kessen, A. Ortony, and F. Kraik (Eds.), pp. 295–322.
Lawrence Erlbaum Associates, Hillsdale, NJ.
7. del Soldato, T. and du Boulay, B. (1995). Implementation of motivational tactics in tutoring
systems. Journal of Artificial Intelligence in Education, 6(4), 337–376.
8. O’Regan, K. (2003). Emotion and e-learning. Journal of Educational Computing Research, 7(3),
78–91.
9. de Vicente, A. and Pain, H. (2002). Informing the detection of the students’ motivational state:
An empirical study. Proceedings of the Sixth International Conference on Intelligent Tutoring Systems
(ITS 2002). Biarritz, France, June 5–8, 2002.
10. Baker, R.S.J.d., Corbett, A.T., Koedinger, K.R., and Wagner, A.Z. (2004). Off-task behavior in the
cognitive tutor classroom: When students “game the system.” Proceedings of SIGCHI Conference
on Human Factors in Computing Systems. Vienna, Austria, April 24–29, 2004.
11. Batliner, A., Steidl, S., Hacker, C., and Nöth, E. (2008). Private emotions versus social interac-
tion: A data-driven approach towards analysing emotion in speech. User Modeling and User-
Adapted Interaction, 18(1–2), 175–206.
12. Campbell, N. (2006). A language-resources approach to emotion: Corpora for the analysis of
expressive speech. In Proceedings of a Satellite Workshop of the International Conference on Language
Resources and Evaluation (LREC 2006) on Corpora for Research on Emotion and Affect. Genoa, Italy,
May 25, 2006.
13. Nkambou, R. (2006). Towards affective intelligent tutoring system. In Proceedings of Workshop on
Motivational and Affective Issues in ITS, in the Eighth International Conference on Intelligent Tutoring
Systems. Jhongli, Taiwan, June 27, 2006.
14. Burleson, W. (2006). Affective learning companions: Strategies for empathetic agents with real-
time multimodal affective sensing to foster meta-cognitive and meta-affective approaches to
learning, motivation, and perseverance. PhD dissertation. MIT Media Lab, Boston, MA.
15. D’Mello, S.K., Craig, S.D., Gholson, B., Franklin, S., Picard, R., and Graesser, A.C. (2005).
Integrating affect sensors in an intelligent tutoring system. Proceedings of Affective Interactions:
The Computer in the Affective Loop, Workshop in International Conference on Intelligent User Interfaces.
San Diego, CA.
16. McQuiggan, S.W., Mott, B.W., and Lester, J.C. (2008). Modeling self-efficacy in intelligent
tutoring systems: An inductive approach. User Modeling and User-Adapted Interaction, 18(1–2),
81–123.
17. D’Mello, S.K., Craig, S.D., Witherspoon, A., McDaniel, B., and Graesser, A. (2008). Automatic
detection of learner’s affect from conversational cues. User Modeling and User-Adapted Interaction,
18(1–2), 45–80.
18. Porayska-Pomsta, K., Mavrikis, M., and Pain, H. (2008). Diagnosing and acting on student
affect: the tutor’s perspective. User Modeling and User-Adapted Interaction, 18(1–2), 125–173.
19. Baker, R.S.J.d. (2007). Modeling and understanding students’ off-task behavior in intelligent
tutoring systems. In Proceedings of Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems. San Jose, CA, April 14–18, 2007.
20. Beck, J.E. (2004). Using response times to model student disengagement. In Proceedings of
ITS2004 Workshop on Social and Emotional Intelligence in Learning Environments. Maceio, Brazil,
August 31, 2004.
21. Cocea, M. and Weibelzahl, S. (2007). Cross-system validation of engagement prediction from
log files. In Proceedings of Second European Conference on Technology Enhanced Learning (EC-TEL
2007). Crete, Greece, September 17–20, 2007.
22. Humphreys, M.S. and Revelle, W. (1984). Personality, motivation, and performance: A theory
of the relationship between individual differences and information processing. Psychological
Review, 91(2), 153–184.
Log-Based Assessment of Motivation in Online Learning 297
23. de Vicente, A. and Pain, H. (1998). Motivation diagnosis in intelligent tutoring systems. In
Proceedings of Fourth International Conference on Intelligent Tutoring Systems. San Antonio, TX,
August 16–19, 1998.
24. Kleinginna, P.R. and Kleinginna, A.M. (1981). A categorized list of emotion definitions, with
suggestions for a consensual definition. Motivation and Emotion, 5(4), 345–378.
25. Deci, E.L. and Ryan, R.M. (1985). Intrinsic Motivation and Self-Determination in Human Behavior.
Plenum, New York.
26. Dweck, C.S. (1986). Motivational processes affecting learning. American Psychologist, 41(10),
1040–1048.
27. Ames, C. and Archer, J. (1988). Achievement goals in the classroom: Students’ learning strate-
gies and motivation. Journal of Educational Psychology, 80(3), 260–267.
28. Elliott, E.S. and Dweck, C.S. (1988). Goals: An approach to motivation and achievement. Journal
of Personality and Social Psychology, 54(1), 5–12.
29. Greene, B.A., Miller, R.B., Crowson, H.M., Duke, B.L., and Akey, K.L. (2004). Predicting high
school students’ cognitive engagement and achievement: Contributions of classroom percep-
tions and motivation. Contemporary Educational Psychology, 29(4), 462–482.
30. Masgoret, A.M. and Gardner, R.C. (2003). Attitudes, motivation, and second language learn-
ing: A meta-analysis of studies conducted by Gardner and associates. Language Learning, 23(1),
123–163.
31. Singh, K., Granville, M., and Dika, S. (2002). Mathematics and science achievement: Effects of
motivation, interest, and academic engagement. Journal of Educational Research, 95(6), 323–332.
32. Wong, M.M.-h. and Csikszentmihalyi, M. (1991). Motivation and academic achievement: The
effects of personality traits and the quality of experience. Journal of Personality, 59(3), 539–574.
33. Qu, L. and Johnson, W.L. (2005). Detecting the learner’s motivational states in an interactive
learning environment. In Proceedings of the 12th International Conference on Artificial Intelligence
in Education (AIED’2005). Amsterdam, the Netherlands, July 18–22, 2005.
34. Zhang, G., Cheng, Z., He, A., and Huang, T. (2003). A WWW-based learner’s learning moti-
vation detecting system. In Proceedings of the International Workshop on Research Directions and
Challenge Problems in Advanced Information Systems Engineering, at the First International Conference
on Knowledge Economy and Development of Science and Technology. Honjo City, Japan, September
17, 2003.
35. Hwang, W.-Y. and Wang, C.-Y. (2004). A study of learning time patterns in asynchronous learn-
ing environments. Journal of Computer Assisted Learning, 20(4), 292–304.
36. Nachmias, R. and Hershkovitz, A. (2007). A case study of using visualization for understand-
ing the behavior of the online learner. In Proceedings of the International Workshop on Applying
Data Mining in e-Learning, at the Second European Conference on Technology Enhanced Learning
(EC-TEL’07). Crete, Greece, September 17–20, 2007.
37. Heyman, G.D. and Dweck, C.S. (1992). Achievement goals and intrinsic motivation: Their rela-
tion and their role in adaptive motivation. Motivation and Emotion, 16(3), 231–247.
38. Ryan, R.M. and Deci, E.L. (2000). Intrinsic and extrinsic motivations: Classic definitions and
new directions. Contemporary Educational Psychology, 25(1), 54–67.
39. Castro, F., Vellido, A., Nebot, A., and Mugica, F. (2007). Applying data mining techniques to
e-learning problems. In L.C. Jain, T. Raymond and D. Tedman (Eds.), Evolution of Teaching and
Learning Paradigms in Intelligent Environment, Vol. 62, pp. 183–221. Springer-Verlag, Berlin,
Germany.
40. Romero, C. and Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert
Systems with Applications, 33(1), 135–146.
21
Mining Student Discussions for Profiling
Participation and Scaffolding Learning
Contents
21.1 Introduction......................................................................................................................... 299
21.2 Developing Scaffolding Capability: Mining Useful Information from
Past Discussions..................................................................................................................300
21.2.1 Step 1: Discussion Corpus Processing................................................................. 301
21.2.2 Step 2: Technical Term Processing....................................................................... 301
21.2.3 Step 3: Term Vector Generation............................................................................ 302
21.2.4 Step 4: Term Weight Computation....................................................................... 302
21.2.5 Step 5: Similarity Computation and Result Generation.................................... 302
21.2.6 Step 6: Evaluation of System Responses.............................................................. 303
21.3 Profiling Student Participation with Gender Data and Speech Act Classifiers.........304
21.3.1 Speech Act Classifiers............................................................................................304
21.3.2 Gender Classifier/Distribution............................................................................. 306
21.3.3 An Application of Gender Classifier/Distribution............................................ 307
21.4 Related Work.......................................................................................................................308
21.5 Summary and Discussion................................................................................................. 309
References......................................................................................................................................309
21.1╇ Introduction
Online discussion boards play an important role in distance education and Web-enhanced
courses. Recent studies have pointed to online discussion boards as a promising strategy
for promoting collaborative problem-solving courses and discovery-oriented activities
[1,2]. However, other research indicates that existing systems for online discussion may not
always be fully effective in promoting learning in undergraduate courses. For example,
some analyses of collaborative online learning indicate that student participation is low
or weak, even when students are encouraged to participate [3,4]. As course enrollments
increase, with some introductory courses enrolling several hundred students, the heavier
online interaction can place a considerable burden on instructors and teaching assistants.
We are developing instructional tools that can automatically assess student participation
and promote interactions.
299
300 Handbook of Educational Data Mining
In this chapter, we present two novel tools that apply data mining and information
retrieval techniques. First, we describe an approach that scaffolds undergraduate stu-
dent discussions by retrieving useful information from past student discussions. We first
semiautomatically extract domain terms from textbooks specified for the courses, and use
them in modeling individual messages with term vectors. We then apply term frequency
and inverse document frequency (TF-IDF) [5] in retrieving useful information from past
student discussions. The tool exploits both the discussions from the same undergraduate
course and the ones from a graduate-level course that share similar topics. Our hypothesis
is that since graduate discussions are full of rich, elucidating dialogues, those conversa-
tions can be recommended as interesting references for undergraduates. We also hypoth-
esize that we can scaffold discussions by sending messages from past students who had
similar assignments or problems regarding course topics. We analyze the usefulness of the
retrieved information with respect to relevance and technical quality.
The second section of the chapter presents an instructional tool that profiles student
contributions with respect to student genders and the roles that they play in discussions.
The discussion threads are viewed as a special case of human conversation, and roles of
a message with respect to its previous message are described with respect to speech acts
(SAs) such as question, answer, elaboration, and/or correction [6]. We apply two SA classifiers:
question classifier and answer classifier [6]. The question classifier identifies messages that
play a role of asking questions, and the answer classifier detects messages with answers
in response to a previous message. We use the classification results in profiling male and
female student contributions.
We performed an initial evaluation of these tools in the context of an undergraduate
operating systems course offered by the computer science department at the University of
Southern California (USC). The current results for the scaffolding tool indicate that past
discussions from the same course contain many similar concepts that we can use in guid-
ing the discussions. Although graduate-level discussions did not contain many similar
topics, their technical quality was higher. The initial results from the profiling tool show
that female participation in undergraduate-level discussions is lower than that in grad-
uate-level discussions, and graduate female students post more questions and answers
compared to undergraduate female students.
1
Question UG I G
corpus Msg. corpus Msg. corpus Msg. corpus
Ques. Msg Msg. “Mi” from
“Q” discussion
3 4
Term vector (msg. Mi) Term-weight vector
Feature Mi (Ti1, Ti2, ..., TiN) Mi (Wi1, Wi2, ..., WiN)
Term weight sim(Q, Mi)
extraction computation
Term vector (Ques. Q) Term-weight vector Similarity score
Q (Tq1, Tq2, ..., TqN) Q (Wq1, Wq2, ..., WqN) computation
5
Textbook
Similar message results
glossaries
TECH_TERM 2 4. Results Q => M1 (sim score = 0.9)
table (ranked by M2 (sim score = 0.7)
Technical sim. score) ....
terms from Mk (sim score = 0.5)
both courses
FIGURE 21.1
Mining relevant or similar messages from past discussions.
where
N is the total number of technical terms in the domain
Tijâ•›=â•›0 if a term is missing in that message
TABLE 21.1
TF-IDF Feature Weight Computation
N
Wik = TFik ⋅ log (21.1)
nk
where
Wikâ•›=â•›TF-IDF weight for term k in document (message) Mi
TFikâ•›=â•›frequency of term k in document (message) Mi
IDFkâ•›=â•›inverse document frequency of term k in the discussion corpus
N
= log
nk
Nâ•›=â•›total number of documents (messages) in the discussion corpus
nkâ•›=â•›number of messages in the discussion corpus that contain the term k
Mining Student Discussions for Profiling Participation and Scaffolding Learning 303
TABLE 21.2
Cosine Similarity Computation (between Document/Message Mi and
Question Message Q)
∑
n
wqk ⋅ wik
sim(Q , Mi ) = k =1
(21.2)
∑ ∑
n n
(wqk )2 ⋅ (wik )2
k =1 k =1
where
Qâ•›=â•›(Wq1, Wq2, Wq3, …, Wqk, …, Wqn)
= feature-weight vector representing question message Q
Miâ•›=â•›(Wi1, Wi2, Wi3, …, Wik, …, Win)
= feature-weight vector representing message Mi from the discussion corpus
sim(Q, Mi)â•›=â•›similarity score between messages Q and Mi
wqkâ•›=â•›TF-IDF weight for feature term k in question message Q
wikâ•›=â•›TF-IDF weight for feature term k in message Mi
there is some amount of technical content in the message, we retrieve the messages with
at least three technical terms. The results can be sent as a response (Step 5 in Figure 21.1).
5 5
Technical quality rating (1–5)
4 4
2 2 1.74
1 1
0 0
Student messages Instructor Student messages Student message Instructor Student messages
Undergraduate messages Graduate Undergraduate messages Graduate
course Undergraduate course course Undergraduate course
course course
FIGURE 21.2
Evaluation results by human judges for technical quality and usefulness/relevance.
messages were rated lower than others. The number of messages retrieved from G is less.
However, their technical quality seems a little better than that of the other two (3.31).
Figure 21.2 (right) shows the average ratings of relevance. In this case, the evaluators
assessed the relevance of system responses to the given message. The evaluators checked
“how useful the message is as a response” or “how related the message was to the given
message.” The UG results returned by the system tend to be less relevant (average rating
~2.52) to the given message than the ones from I (average rating ~2.67). The average rating
for responses from G is the lowest (~1.74). Even though the graduate student messages tend
to be more technical than the undergraduate student messages, they might not be fully
relevant to the given undergraduate student message. Although there are some common
topics between the undergraduate and the graduate courses, the kinds of problems that
they are discussing seem different. In addition, we had a smaller set of data for G, which
made it harder for the system to extract relevant results.
is given the label “elaboration.” Other SA categories include question, answer, correction,
acknowledgment, etc.
We grouped the SAs to create two binary classifiers, one which identifies questions and
the other which identifies answers in questions and answer-type discussion threads [6].
Using the question classifier, we identified the question messages posted by students from
among all the messages in the corpus. The answer classifier was used similarly to iden-
tify the messages from the same corpus that contained answers or suggestions. Questions
and answers occasionally appear in the same posts so that a particular message may be
classified as both a question and an answer, that is, the identifications are not mutually
exclusive.
The training set consisted of 1010 messages. The test set had 824 messages. Besides typi-
cal data preprocessing and cleaning steps taken by many natural language processing
(NLP) systems, such as stemming and filtering, our system performs additional steps for
removing noise and reducing variances.
We first remove the text from previous messages that is automatically inserted by the
discussion board system when the user clicks on a reply-to button. We also apply a sim-
ple stemming algorithm that removes “s” and “es” for plurals. Since the corpus contains
mostly technical discussions, it comprises many computer science concepts and terms
including programming code fragments. Each section of programming code or code frag-
ment is replaced with a single term called code. We then use a transformation algorithm
that replaces common words or word sequences with special category names. For exam-
ple, many pronouns like “I,” “we,” and “you” are replaced by the symbol categ_person and
sequences of numbers by categ_number_seq. For “which,” “where,” “when,” “who,” and
“how,” we used the term categ_w_h. Similar substitution patterns were used for a number
of categories like filetype extensions (“.html,” “.c,” “.c++,” “.doc”), URL links, and others.
Students also tend to use informal words (e.g., “ya,” “yeah,” “yup”) and typographical
symbols such as smiley faces as acknowledgment, support, or compliment. We transform
such words into consistent words or symbols. We also substitute words like ‘re, ‘m, ‘ve,
don’t, etc., with “are,” “am,” “have,” “do not,” etc. Finally, since SA tends to rely more on
surface word patterns rather than technical terms used, technical terms occurring in the
messages were replaced by a single word called tech_term.
For our classifiers, we use N-gram features that represent all possible sequences of N
terms. That is, unigrams (single-word features), bigrams (sequence of two words), trigrams
(sequence of three words), and quadrograms (sequence of four words) are used for train-
ing and building the classifier models. There were around 5000 unigrams or unique words
occurring in the training corpus. Since the data was very noisy and incoherent, the feature
space is larger and contains a lot of extraneous features.
We use information gain theory for pruning the feature space and selecting features
from the whole set [10]. Information gain value for a particular feature gives a measure
of the information gained in classification prediction, that is, how much the absence or
the presence of the feature may affect the classification. First, we compute the informa-
tion gain values for different N-gram features extracted from the training data (Equation
21.3 in Table 21.3). For each feature, we compute two values, one for the question classifier
(called QC) and the other for the answer classifier (called AC). Subsequently, all the fea-
tures (unigrams, bigrams, trigrams, and quadrograms) are sorted based on the informa-
tion gain values. We use the top 200 features for each classifier. Some of the top N-gram
features for QC and AC are shown in Table 21.4.
We use a linear support vector machine (SVM) implementation [11] to learn SA classi-
fiers from the training data. Linear SVM is an efficient machine learning technique used
306 Handbook of Educational Data Mining
TABLE 21.3
Information Gain Computation for Features in Speech Act
Classification
where
kâ•›=â•›feature term k present in message
–
k â•›=â•›feature term k not present in message
Câ•›=â•›message belongs to class C (e.g., question)
–
Câ•›=â•›message does not belong to class C (e.g., not a question)
G(k)â•›=â•›information gain corresponding to feature k
TABLE 21.4
Top N grams Based on Information Gain
Category 1-gram 2-gram 3-gram 4-gram
QUES ? do [categ_person] [categ_w_h] should do [categ_person] have to
[categ_w_h] [tech_term] ? [categ_person] do [categ_person] need to
will can [categ_person] was [tech_term] [tech_term]
do [categ_person] wondering [tech_term] ?
confused is there [or/and] do [categ_person] is there a better
? thanks is there a does this mean that
the [tech_term] ?
ANS Yes look at look at the [categ_person] am a
am [or/and] do for example, [tech_term]
helps seems like . [categ_person] should do [categ_person] have to
but in [tech_term] let [me/him/her/us] know look at the [tech_term]
depends stated in not seem to in the same [tech_term]
[tech_term] is not
[tech_term]
often in text classification and categorization problems. We constructed feature vectors for
all the messages in the corpus. A feature vector of a message consisted of a list of values
for individual features that represent whether the features existed in the message or not.
We perform fivefold cross-validation experiments on the training data to set kernel param-
eter values for the linear SVM. After we ran SVM, the resulting classifiers (QC and AC)
were then used to predict the SAs for the feature vectors in the test set. QC tells whether a
particular message contains question content or not, and AC predicts whether a message
contains answer content, that is, answers or suggestions. The classification was then com-
pared with the human annotations. The resulting QC and AC had accuracies of 88% and
73%, respectively [6].
TABLE 21.5
Male/Female Student Distribution (Registered/Participating) in the Two
Courses
Course Students Males (%) Females (%)
Undergraduate Among total students registered 89 11
Among students posting messages 91.3 8.7
(participating in discussions)
Graduate Among total students registered 86.4 13.6
Among students posting messages 87.5 12.5
(participating in discussions)
TABLE 21.6
Male/Female Student Participation by Message Type
Course Message Type Males (%) Females (%)
Undergraduate Question messages 97 3
Answer messages 96 4
Graduate Question messages 78 22
Answer messages 88 12
308 Handbook of Educational Data Mining
total females
60 60
40 40
20 20
0 0
Postings Questions Answers Postings Questions Answers
FIGURE 21.3
Comparison of participation by gender in undergraduate and graduate courses.
contribute to a better learning environment for students. Related work on dialogue model-
ing investigates different ways to manage continuous dialogue for personal assistants [14]
or scaffold conversations in a virtual meeting space [15]. In contrast, we focus on optimiz-
ing a one-step question response by mining knowledge from archived materials asynchro-
nously. More similarly, some systems can generate helpdesk responses using clustering
techniques [16], but their corpus is composed of two-party, two-turn conversation pairs,
which makes it easier to bypass the complex analysis of discussions among multiple
parties.
In the area of online learning, much attention has been paid to the analysis of student
learning behaviors in online communications. Various frameworks have been proposed
for characterizing and analyzing computer-mediated communications in the context of
collaborative discussions [17], e-mail and chat exchanges [18], knowledge sharing [19], and
general argumentation [20,21], but none have been sufficiently relevant or fine grained to
facilitate data mining and answer extraction in threaded discussions.
References
1. Scardamalia, M. and Bereiter, C., Computer support for knowledge building communities. In:
T. Koschmann (Ed.), CSCL: Theory and Practice of an Emerging Paradigm. Mahwah, NJ: Erlbaum,
1996.
2. Koschmann, T. (Ed.), CSCL: Theory and Practice of an Emerging Paradigm. Hillsdale, NJ: Lawrence
Erlbaum, 1996.
310 Handbook of Educational Data Mining
3. Pallof, R.M. and Pratt, K., Building Learning Communities in Cyberspace: Effective Strategies for the
Online Classroom. San Francisco, CA: Jossey-Bass, 1999.
4. Kim, J. and Beal, C., Turning quantity into quality: Supporting automatic assessment of on-line
discussion contributions. American Educational Research Association (AERA) Annual Meeting, San
Francisco, CA, 2006.
5. Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by
Computer. Reading, MA: Addison-Wesley, 1989.
6. Ravi, S. and Kim, J., Profiling student interactions in threaded discussions with speech act clas-
sifiers. Proceedings of AI in Education Conference, Los Angeles, CA, 2007.
7. Feng, D., Shaw, E., Kim, J., and Hovy, E.H., An intelligent discussion-bot for answering stu-
dent queries in threaded discussions. Proceedings of International Conference on Intelligent User
Interfaces, Sydney, Australia, pp. 171–177, 2006.
8. Austin, J., How to Do Things with Words. Cambridge, MA: Harvard University Press, 1962.
9. Searle, J., Speech Acts. Cambridge, U.K.: Cambridge University Press, 1969.
10. Yang, Y. and Pedersen, J.O., A comparative study on feature selection in text categorization.
Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp. 412–420,
1997.
11. Chang, C.-C. and C.-J. Lin, LIBSVM: A library for support vector machines. Journal of Machine
Learning Research, 1, 161–177, 2001.
12. Hermjakob, U., Hovy, E.H., and Lin, C., Knowledge-based question answering. Proceedings of
Text Retrieval Conference, Gaithersburg, Maryland, 2000.
13. Pasca, M. and Harabagiu, S., High performance question/answering. Proceedings of SIGIR,
New Orleans, LA, pp. 366–374, 2001.
14. Nguyen, A. and Wobcke, W., An agent-based approach to dialogue management in personal
assistants. Proceedings of International Conference on Intelligent User Interfaces, San Diego, CA,
pp. 137–144, 2005.
15. Isbister, K., Nakanishi, H., Ishida, T., and Nass, C., Helper agent: Designing an assistant
for human-human interaction in a virtual meeting space. Proceeding of the Computer Human
Interaction Conference, Hayama-machi, Japan, pp. 57–64, 2000.
16. Marom, Y. and Zukerman, I., Corpus-based generation of easy help-desk responses. Technical
Report. School of Computer Science and Software Engineering, Monash University, 2005.
17. Shaw, E., Assessing and scaffolding collaborative learning in online discussion. Proceedings of
Artificial Intelligence in Education Conference, Amsterdam, the Netherlands, pp. 587–594, 2005.
18. Cakir, M., Xhafa, F., Zhou, N., and Stahl, G., Thread-based analysis of patterns of collaborative
interaction in chat. Proceedings of Artificial Intelligence in Education Conference, Amsterdam, the
Netherlands, pp. 716–722, 2005.
19. Soller, A. and Lesgold, A., Computational approach to analyzing online knowledge sharing
interaction. Proceedings of Artificial Intelligence in Education Conference, Sydney Australia, 2003.
20. Feng, D., Kim, J., Shaw, E., and Hovy E., Towards modeling threaded discussions through
ontology-based analysis. Proceedings of National Conference on Artificial Intelligence, Boston, MA,
2006.
21. Painter, C., Coffin, C., and Hewings, A., Impacts of directed tutorial activities in computer con-
ferencing: A case study. Distance Education, 24(2), 159–173, 2003.
22
Analysis of Log Data from a Web-Based
Learning Environment: A Case Study
Judy Sheard
Contents
22.1 Introduction......................................................................................................................... 311
22.2 Context of the Study........................................................................................................... 312
22.3 Study Method...................................................................................................................... 313
22.3.1 Participants.............................................................................................................. 313
22.3.2 Data Collection and Integration........................................................................... 313
22.4 Data Preparation................................................................................................................. 313
22.4.1 Abstraction Definitions.......................................................................................... 313
22.4.2 Data File Construction........................................................................................... 314
22.4.3 Data Cleaning: Removal of Outliers.................................................................... 314
22.5 Data Analysis Methods...................................................................................................... 315
22.6 Analysis Results.................................................................................................................. 316
22.6.1 Page View Abstraction........................................................................................... 316
22.6.1.1 Sample Findings....................................................................................... 316
22.6.2 Session Abstraction................................................................................................ 316
22.6.2.1 Sample Findings....................................................................................... 316
22.6.3 Task Abstraction...................................................................................................... 317
22.6.3.1 Sample Findings....................................................................................... 317
22.6.4 Activity Abstraction............................................................................................... 319
22.6.4.1 Sample Findings....................................................................................... 319
22.7 Discussion............................................................................................................................ 320
22.8 Conclusions.......................................................................................................................... 322
References...................................................................................................................................... 322
22.1╇ Introduction
Web-based learning environments in the form of a course Web site or a learning manage-
ment system are used extensively in tertiary education. When providing such an environ-
ment, an educator will generally have an expectation for how it will be used by students;
however, often this does not match actual usage. Students may access the learning envi-
ronment at different rates or for different times, or for purposes other than those intended
by their educator. A Web-based learning environment is typically complex and there are
various determinants for usage, which may relate to the technology, the teaching program,
311
312 Handbook of Educational Data Mining
or the student. Information about how students use such an environment is important for
effective design of an education program, but difficult to gain using traditional evalua-
tion methods. The aim of this study was to investigate the usage of a Web-based learning
environment from analysis of student interactions with the environment. The Web-based
learning environment explored was a Web site developed to support and monitor students
working on a capstone project.
The questions that guided this investigation were
• How frequently and when did students access the Web site?
• What resources did students use?
• What time did students spend at the Web site?
• What were the patterns of use over the year?
• Were there any differences in usage based on gender or on course performance?
This case study will serve to illustrate data collection and preparation techniques, and the
type of information that can be gained from statistical analysis of the data gathered using
the techniques described in Chapter 3.
3. Task—a sequence of interactions of a user within one resource, for example, the
Time Tracker or File Manager. The elapsed time for a task was calculated by mea-
suring the time interval from the first interaction within a resource until the last
interaction with that resource or the first interaction with another resource within
the same session.
4. Activity—a series of one or more interactions with the Web site to achieve a partic-
ular outcome. In the context of the WIER Web site, examples of learning activities
were as follows: recording a task time in the Time Tracker, posting a newsgroup
item, uploading a file to the File Manager or navigating to another resource. Not
all activities were taken to completion and in many sessions there were a mixture
of completed and partial activities.
The specification of the activities was a complex process involving the Web site developer
and the capstone project coordinator. A software tool was written specifically for this pro-
cess [3]. There were over 144 ways in which students could interact with WIER, equivalent
to the number of pages on the Web site. Each possible student interaction with the Web
site was examined to determine its purpose and then it was defined as an activity or a
component of an activity. Using this method, every interaction was identified as part of
one or more activities and a series of templates of activity abstractions was prepared. As
a verification of these templates, all possible student activities on WIER were performed
and each activity definition matched to the log file scripts generated. The templates may be
seen as forming the educator model of the Web site [4].
or performing all activities. A refinement of this technique could vary the interaction
�interval limits depending on the type of resource or activity, as discussed in Redpath
and Sheard [6].
• Counts and frequencies at each level of abstraction: The frequencies of page views,
sessions, tasks, and activities were used to give measures of Web site usage and
were described using totals, percentages, and medians. The distributions of these
frequencies over line periods and student groups showed high skewness and
kurtosis, and outlying data. These indicated that medians rather than means
were more appropriate measures of central tendency for these distributions.
• Comparisons of counts and frequencies across time periods and student groups: The
frequencies of page views, sessions, tasks, and activities were compared using
Mann–Whitney U and Kruskal Wallis tests. The nonconformance with normality
of these distributions meant that nonparametric statistical tests were more appro-
priate than parametric tests.
• Length of time of sessions, tasks, and activity abstractions: The total time for all sessions
was calculated for each student. This was used to give a measure of the mean
times students spent using the Web site each week. The session, task, and activ-
ity times were used to give measures of Web site usage and were described using
totals, percentages, and medians. The distribution of these times showed high
positive skewness and kurtosis, indicating that medians were more appropriate
descriptive statistics than means.
• Comparisons of session, task, and activity times: The session, task, and activity times
between groups of students were compared using Mann–Whitney U and Kruskal
Wallis tests. The nonconformance with normality meant that nonparametric sta-
tistical tests were more appropriate than parametric tests.
• Counts of interactions within activity sequences: The lengths of activity sequences
were used to give a measure of activity performance efficiency. The lengths were
compared using Mann–Whitney U and Kruskal Wallis tests. Spearman’s rank-
order correlation coefficients were used to search for relationships.
• Percentages of completed activities: Chi-square tests were used to test for compari-
sons of complete and partially complete activities.
• Comparisons of variances of frequency distributions: The variances of the frequency
of the performance of activities between different semesters of the project were
compared using Levene’s homogeneity of variance test. This test does not assume
conformance to normality.
316 Handbook of Educational Data Mining
2000
Number of sessions
1500
1000
500
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Week of the project
FIGURE 22.1
Frequency of sessions on WIER per week of the project.
3000
2500
Number of sessions
2000
1500
1000
500
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour
FIGURE 22.2
Frequency of sessions on WIER per hour of the day.
7000
6000
5000
Number of sessions
4000
3000
2000
1000
0
Mon Tue Wed Thu Fri Sat Sun
Day
FIGURE 22.3
Frequency of sessions on WIER per day of the week.
File Manager
New Time Tracker
Time Graph
Old Time Tracker
Group Forum
Group Details
Documents
Risk List
Event Scheduler
Group News
Edit Details
Resource
Final Submission
Past Projects
View Projects
Final Tasks
Function Point
Team Roles
Timelines
Resources
Introduction
Select Team
Class Discussion
Help
FIGURE 22.4
Frequency of access to resources.
low proportions of time on each of the other resources. There were far fewer accesses to
and time spent on the communication resources (Group Forum, Group News, and Class
Discussion). Further analysis showed that the File Manager was the last resource accessed
in more than half the sessions, providing an indication that File Manager tasks were the
main purpose of these visits [9].
Analysis of Log Data from a Web-Based Learning Environment: A Case Study 319
The median number of accesses to the File Manager for the high-achieving students was
80 compared with 53 for the low-achieving students, and a Mann–Whitney U test indi-
cated this difference was significant (Uâ•›=â•›1389, pâ•›<â•›0.05). Furthermore, the high-achieving
students spent more time than the low-achieving students using the File Manager. The
median total time spent on the File Manager for the high-achieving students was 254â•›min
and 18â•›s compared with 131â•›min and 19â•›s for the low-achieving students, and a Mann–
Whitney U test indicated that this difference was significant (Uâ•›=â•›1430, pâ•›<â•›0.05).
There was no difference in frequency of access to the File Manager based on gender,
however, the female students spent more time in using the File Manager than the male
students. The median total time spent on the File Manager for the female students was
195â•›min and 51â•›s compared with 156â•›min and 21â•›s for the male students, and a Mann–
Whitney U test indicated this difference was significant (Uâ•›=â•›5753, pâ•›<â•›0.01). A similar result
was found with the use of the New Time Tracker. For this resource, the median total time
spent for the female students was 316â•›min and 46â•›s compared with 245â•›min and 23â•›s for
the male students, and a Mann–Whitney U test indicated this difference was significant
(Uâ•›=â•›5722, pâ•›<â•›0.05).
Long activity sequences could indicate that students were using a resource in an inef-
ficient way and perhaps having difficulty performing the activity. For a log time activity,
the minimum sequence length was three interactions. Cross tabulations with a chi-square
test indicated that low-achieving students recorded more sequences of length greater than
three than high-achieving students, χ2 (1, Nâ•›=â•›1531)â•›=â•›8.18, pâ•›<â•›0.01. This suggested that the
high-achieving students used the resource more efficiently and appeared to have less dif-
ficulty in performing tasks. There was no difference based on gender.
The analyses of the frequency of activities showed that students became more regular
in their work habits over the course of the IE Project. For example, the comparison of
the frequency of download file and upload file activities on a weekly basis indicated that
more consistent access to files occurred in the second semester than the first semester
(refer Figures 22.5 and 22.6). The variations in frequencies of access were compared using
Levene’s test of the homogeneity of variance. This indicated a difference in the variances
between the two semesters and this was significant at pâ•›<â•›0.01.
22.7╇ Discussion
The usage of the WIER Web site and resources in terms of frequency of access and time
spent at the site provided insights into student learning behavior. The daily, weekly, and
whole project patterns of access to WIER indicate that it supported a variety of work styles.
Students became more regular and efficient in the performance of activities over the course
of the project.
The level of the use of resources and performance of activities show that students
were very functional in their use of this learning environment. Huge differences were
found in the level of engagement within each resource. The students engaged most with
facilities that were mandated and assessed. A notable exception was the File Manager,
12,000
Number of file download and upload activities
10,000
8,000
6,000
4,000
2,000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Week of first semester
FIGURE 22.5
Frequency of file uploads and downloads per week of the first semester.
Analysis of Log Data from a Web-Based Learning Environment: A Case Study 321
12,000
8,000
6,000
4,000
2,000
0
21 22 23 24 25 26 27 28 29 30 31 32 33 34
Week of second semester
FIGURE 22.6
Frequency of file uploads and downloads per week of the second semester.
which students were not required to use, but apparently found useful. The least-used
resources were those provided for communication and collaboration. Students have a
number of ways to communicate and work together on project tasks and most did not
choose to do this through WIER. This highlighted a mismatch between how the educa-
tors intended WIER to be used and how the students wanted to work and suggests that
educator participation and guidance may be necessary to encourage students to engage
in this way.
Comparisons of usage based on student performance showed a greater use of WIER by
the high-achieving students, giving indications of the value of WIER for their work. As an
example, the greater engagement with the Time Tracker activities by the high-achieving
students appears to indicate a more thorough and explicit analysis and recording of the
details of their project work. In contrast, the higher use of the download file activity in the
Document Templates by the weaker students indicates that they may have been seeking
extra guidance in the form of examples to follow. This type of information is potentially
useful to educators. Monitoring student use of particular Web site resources can be used to
determine students who are having difficulty with their work or are at risk of not achiev-
ing a successful outcome.
The analysis of activities based on gender gave further insights into diversity in behavior.
The female students spent more time using the Time Tracker and File Manager, indicat-
ing they may take on more administrative roles in the project groups. This was supported
by the analysis of the activities in these resources. From an educational perspective, such
allocations of project roles have been found to be inappropriate by the capstone project
coordinators and this particular finding encouraged the coordinators to review the role-
allocation processes within the groups.
Overall, the results indicated that the capstone project coordinators’ expectations were
not necessarily in line with the ways students wanted to learn. Students will use and adapt
technology to suit their own needs. This highlights the value in understanding student
work styles and preferences in order to provide an environment that is useful and valuable
to them.
322 Handbook of Educational Data Mining
22.8╇ Conclusions
The analysis of interactions data at page view, session, task, and activity abstractions, as
demonstrated by this study of a courseware Web site, extends the type of usage analysis
typically conducted on student interactions data. This type of analysis can lead to deeper
insights into student learning behaviors, providing valuable information that may be used
by educators and Web site designers when making decisions about their teaching pro-
grams, learning environment, and resources. The ultimate goal is to create useful and
usable learning environments for students.
References
1. Hagan, D., Tucker, S., and Ceddia, J. 1999. Industrial experience projects: A balance of process
and product. Computer Science Education, 9(3): 106–113.
2. Ceddia, J., Tucker, S., Clemence, C., and Cambrell, A. 2001. WIER—Implementing artifact
reuse in an educational environment with real projects. In Proceedings of 31st Annual Frontiers in
Education Conference, Reno, NV, pp. 1–6.
3. Ceddia, J., Sheard, J., and Tibbey, G. 2007. WAT: A tool for classifying learning activities from a
log file. In Proceedings of Ninth Australasian Computing Education conference (ACE2007), Ballarat,
Australia, Australian Computer Society, pp. 11–17.
4. Ceddia, J. and Sheard, J. 2005. Log files for educational applications. In Proceedings of World
Conference on Educational Multimedia, Hypermedia and Telecommunications (ED-MEDIA 2005),
Montreal, Canada, Association for the Advancement of Computing in Education (AACE):
Norfolk, VA, pp. 4566–4573.
5. Sheard, J., Ceddia, J., Hurst, J., and Tuovinen, J. 2003. Determining Website usage time from
interactions: Data preparation and analysis. Journal of Educational Technology Systems, 32(1):
101–121.
6. Redpath, R. and Sheard, J. 2005. Domain knowledge to support understanding and treatment
of outliers. In Proceedings of International Conference on Information and Automation (ICIA 2005),
Colombo, Sri Lanka, pp. 398–403.
7. Burns, R. 2000. Introduction to Research Methods, 4th ed. 2000, London, U.K.: Sage Publications
Ltd.
8. Sheard, J., Ceddia, J., Hurst, J., and Tuovinen, J. 2003. Inferring student learning behaviour from
Website interactions: A usage analysis. Journal of Education and Information Technologies, 8(3):
245–266.
9. Buttenfield, B.P. and Reitsma, R.F. 2002. Loglinear and multidimensional scaling models of
digital library navigation. International Journal of Human-Computer Studies, 57: 101–119.
23
Bayesian Networks and Linear Regression Models
of Students’ Goals, Moods, and Emotions
Contents
23.1 Introduction......................................................................................................................... 323
23.2 Predicting Goals and Attitudes........................................................................................ 324
23.2.1 Data Description..................................................................................................... 324
23.2.2 Identifying Dependencies among Variables....................................................... 325
23.2.3 An Integrated Model of Behavior, Attitude, and Perceptions.......................... 326
23.2.4 Model Accuracy...................................................................................................... 329
23.2.5 Case Study Summary............................................................................................. 329
23.3 Predicting Emotions........................................................................................................... 330
23.3.1 Background and Related Work............................................................................. 330
23.3.2 Data Description..................................................................................................... 331
23.3.3 Overall Results........................................................................................................ 332
23.3.4 Students Express Their Emotions Physically..................................................... 332
23.4 Summary and Future Work.............................................................................................. 335
Acknowledgments....................................................................................................................... 336
References...................................................................................................................................... 337
23.1╇ Introduction
If computers are to interact naturally with humans, they should recognize students’ affect
and express social competencies. Research has shown that learning is enhanced when
empathy or support is provided and that improved personal relationships between teach-
ers and students leads to increased student motivation.1–4 Therefore, if tutoring systems
can embed affective support for students, they should be more effective. However, previ-
ous research has tended to privilege the cognitive over the affective and to view learning
as information processing, marginalizing, or ignoring affect.5 This chapter describes two
data-driven approaches toward the automatic prediction of affective variables by creating
models from students’ past behavior (log-data). The first case study shows the methodology
and accuracy of an empirical model that helps predict students’ general attitudes, goals,
and perceptions of the software, and the second develops empirical models for predicting
students’ fluctuating emotions while using the system. The vision is to use these models to
predict students’ learning and positive attitudes in real time. Special emphasis is placed in
this chapter on understanding and inspecting these models, to understand how students
express their emotions, attitudes, goals, and perceptions while using a tutoring system.
323
324 Handbook of Educational Data Mining
TABLE 23.1
Post-Tutor Student Goals, Attitudes, and Perceptions
Goals/Attitudes While Learning
Seriously try learn. How seriously did you try to learn from the tutoring system?
Get it over with (fast). I just wanted to get the session over with, so I went as fast as possible without paying
much attention.
Challenge. I wanted to challenge myself: I wanted to see how many I could get right, asking as little help as
possible.
No care help. I wanted to get the correct answer, but didn’t care about the help or about learning with the
software.
Help fading attitude. I wanted to ask for help when necessary, but tried to become independent of help as time
went by.
Other approaches. I wanted to see other approaches to solving the problem, and thus asked for help even if I
got it right.
Fear of wrong. I didn’t want to enter a wrong answer, so I asked for help before attempting an answer, even if I
had a clear idea of what the answer could be.
Student perceptions of the tutor.
Learned? Do you think you learned how to tackle SAT-Math problems by using the system?
Liked? How much did you like the system?
Helpful? What did you think about the help in the system?
Return? Would you use the system again if there were more problems and help for you to see? How many more
times would you use it again?
Interaction with the tutor.
Audio? How much did you use the audio for the explanations?
is not correlated to “average hints seen per problem,” but it is correlated to “average hints
seen in helped problems.” The trend suggests that students who search deeply for help are
more likely to learn. In addition, learning gain is not significantly correlated with “time
spent in a problem,” but instead to “time spent in problems in which help was seen.” This
suggests that spending much time struggling in a problem and not seeing help will not
lead to learning; instead, a student should spend significant time seeing help. Learning is
negatively correlated to average incorrect attempts per problem, suggesting that students
who tend to make many incorrect responses per problem will not improve much from pre-
to posttest. Many of these correlations are not very strong (in general, neither of them by
themselves accounts for more than 15% of the variance). However, a model that accounts
for all these variables together should allow for a better prediction of the dependent vari-
ables (i.e., goals, attitudes, perceptions of the software and learning).
the system,” though not directly associated to higher learning gains. It is also correlated
to the “challenge” attitude, showing that students might want to make an attempt even if
they risk a wrong answer. One interesting dependency is that a high number of mistakes
per problem is correlated to a higher chance of a student saying he or she wants to “get
over with” (probably just “gaming” and clicking through to get the answer). However,
making a high number of mistakes in problems, where they also request help, is associ-
ated to a lower likelihood of wanting to “get over with” the session, again suggesting
that failing and asking for help is associated to a positive attitude toward learning. The
positive perceptions of the software, such as willingness to return to use the system,
are correlated to productive behaviors that lead to higher learning gains (e.g., “average
hints per problem”). Students who decide to seek for hints seem to be genuinely trying
to learn.
Avg. hints
Help fading
per problem
attitude
% Helped problems:
Challenge help sought before
attitude an attempt
Other approaches
Return? Fear of wrong Pretest incorrect
% improvement
Improved?
Audio?
Pretest correct
Avg. incorrect
Avg. seconds
in helped problems
FIGURE 23.1
Part of the full network of correlations between latent and observed variables. Variables that describe a stu-
dent’s observed interaction style (light-colored nodes) are correlated with the students’ latent attitudes, feelings,
and learning (dark nodes) derived from the survey. Line weight indicates correlation: dashed line (- -) indicates
a negative correlation; lines (___) indicate a positive correlation; thick lines indicate p < 0.01—light lines indicate
correlations of p < 0.05.
Next, the parameters of the network were generated by (1) discretizing all variables in
two levels (high/low) with a median-split; (2) simplifying the model further by discard-
ing existing links whose connecting nodes do not pass a Chi-square test (the dependency
is not maintained after making the variables discrete); (3) creating conditional probabil-
ity tables (CPTs) from the cross-tabulations of the students’ data (“maximum likelihood”
method for parameter learning in discrete models).12
The probability that a student has a goal/attitude given that we know his observable
actions is stated as a conditional probability; dependencies in the network are defined by condi-
tional probability with one entry for each different combination of values that variables can
jointly take.11 This is represented as a table that lists the probability that the child node takes
on, based on different values for each combination of values of its parents, see Table 23.2.
Assume that a BBN represents whether students have a fear of getting the problem
wrong (Figure 23.2, middle row). Consider that the tutor begins with no clear knowledge
Nocarehelp
Return?
Serious?
Challenge
Liked? Helpful?
GetOverWith
Learned?
FearWrongAnswer
Audio?
AvgHints_HelpedProbs
SecsPerProb
AvgHintsPerProblem
AvgAttempts_HelpedProbs Gender TimeBetweenAttempts
%HelpedProbs
FIGURE 23.2
Part of the structure of a Bayesian network to infer attitudes, perceptions, and learning (light gray nodes). The
bottom (leaf) nodes are set as evidence.
TABLE 23.2
Learning Parameters to the BBN
Time between
“Fear of Wrong” “Challenge” Attempts Cases Probability
False False Low 43 0.64 (1)
High 24 0.36 (2)
True Low 35 0.42 (3)
High 48 0.58 (4)
True False Low 8 0.50 (5)
High 8 0.50 (6)
True Low 7 0.32 (7)
High 15 0.68 (8)
Note: Maximum likelihood to learn conditional probability tables for “fear of
wrong” attitude from students’ data.
about whether students will express this fear in the survey: There is 50% probability that
the student will state this fear or not. All nodes in this case study are binary—that is,
nodes have two possible values denoted by T (true) and F (false). Either students will
express this attitude in the surveys or they will not. The strength of the relationship for
two nodes is shown in Table 23.2. When a hidden node, such as “Fear of wrong” is que-
ried, its probability distribution is updated to incorporate all the leaf nodes in Figure
23.2. Two propositions P(fear of wrong) and P(time between attempts) are dependent if a
change in belief about one affects belief in the other. In general, if we are interested in the
TABLE 23.3
Accuracy of Predictions, 10-Fold Cross-Validation
Highly Certain
Predictions %Times
Attribute Accuracy P(T)â•›>â•›0.7 or P(T)â•›<â•›0.3
Get over with? (Attitude) 0.89 96%
Liked? (Perception) 0.82 80%
Learned? (Perception) 0.81 97%
Fear of wrong (Attitude) 0.81 83%
No care help? (Attitude) 0.76 92%
Help fading attitude (Attitude) 0.76 41%
Other approaches (Attitude) 0.75 59%
Gain pre-post test (Cognitive 0.72 37%
Outcome)
Challenge attitude 0.70 28%
Improved? (Cognitive outcome) 0.69 57%
Return? (Perception) 0.65 34%
Audio? (Cognitive outcome) 0.58 57%
Seriousness? (Attitude) 0.54 11%
experience: their goals, attitudes, and whether they learn. We showed how a methodology
that combines machine learning methods and classical statistical analysis were combined
to create a fairly accurate model of students’ latent variables. This model can be used in
real-time so that the tutoring software can make inferences about student emotion—by
keeping “running averages” of behavioral variables (e.g., average hints per problem). This
provides the tutor with an estimation of students’ attitudes and likely outcomes while
students interact with the program. It is interesting that many of the students’ negative
attitudes and unlearning were expressed with different forms of “speeding” within the
software (consistent with past research11,6). Corrective pedagogical decisions can be made
by the tutoring software to change the standard course of action whenever attitudes are
inferred to be negative, and the teacher can be informed via web-based report tools that
are permanently updated.
FIGURE 23.3
Sensors used in the classroom (clockwise): facial expression sensor; conductance bracelet, pressure mouse, and
posture analysis seat.
TABLE 23.4
Cognitive-Affective Terms Based on Human Face Studies
Cognitive-Affective Term Emotion Scale Ekman’s Categorization
High enjoyment “I am enjoying this.” Joy
Little enjoyment …
“This is not fun.”
High frustration “I am very frustrated.” Anger
Little frustration ..
“I am not frustrated at all.”
Interest/novelty “I am very interested.” Interest and surprise
Boredom/dullness …
“I am bored.”
Anxiety “I feel anxious” Fear
Confidence ….
“I feel very confident”
Sources: Ekman, P., Universals and cultural differences in facial expressions of
emotion,
� in J. Cole (Ed.), Nebraska Symposium on Motivation 1971, Vol. 19,
pp. 207–283, University of Nebraska Press, Lincoln, NE, 1972; Ekman, P.,
Facial Expressions, John Wiley & Sons Ltd., New York, 1999.
value.21 Posttest surveys also included questions that measured student perceptions of the
software. Every 5â•›min, as long as students had finished a mathematic problem, a screen
queried their emotion: “How [interested/excited/confident/frustrated] do you feel right
now?” Students choose one of five possible emotion levels, where the ends were labeled
(e.g., I feel anxious… very confident). The emotion queried was randomly selected (obtain-
ing at least one report per student per emotion for most subjects).
B. Predicting emotions from physiological activity and tutor variables, for the last problem seen
Seconds to Seconds # Hints Solved? # Incorrect Time in Gender
1st attempt to solve seen 1st attempt attempts tutor ped. agent
Camera facial
“Concentrating” SitForward Seat sensor SitForward
detection
max. probability Stdev Mean
software
Sensor variables (Mean, Min, Max, Stdev during the lapse of time for the last problem seen)
(b)
FIGURE 23.4
Variables that help predict self-report of emotions. The result suggests that emotion depends on the context in
which the emotion occurs (math problem just solved), and also can be predicted from physiological activity
captured by the sensors (bottom row).
TABLE 23.5
Each Cell Corresponds to a Linear Model to Predict Emotion Self-Reports
Tutor Context All
Only Cameraâ•›+â•›Tutor Seatâ•›+â•›Tutor Wristâ•›+â•›Tutor Mouseâ•›+â•›Tutor Sensorsâ•›+â•›Tutor
Confident Râ•›=â•›0.49, Nâ•›=â•›62 Râ•›=â•›0.72, Nâ•›=â•›20 Râ•›=â•›0.35, Nâ•›=â•›32 Râ•›=â•›0.55, Nâ•›=â•›28 Râ•›=â•›0.82, Nâ•›=â•›17
Frustrated Râ•›=â•›0.53, Nâ•›=â•›69 Râ•›=â•›0.63, Nâ•›=â•›25 Râ•›=â•›0.68, Nâ•›=â•›25 Râ•›=â•›0.56, Nâ•›=â•›45 Râ•›=â•›0.54, Nâ•›=â•›44 Râ•›=â•›0.72, Nâ•›=â•›37
Excited Râ•›=â•›0.43, Nâ•›=â•›66 Râ•›=â•›0.83, Nâ•›=â•›21 Râ•›=â•›0.65, Nâ•›=â•›39 Râ•›=â•›0.42, Nâ•›=â•›37 Râ•›=â•›0.57, Nâ•›=â•›37 Râ•›=â•›0.70, Nâ•›=â•›15
Interested Râ•›=â•›0.37, Nâ•›=â•›94 Râ•›=â•›0.54, Nâ•›=â•›36 Râ•›=â•›0.28, Nâ•›=â•›51 Râ•›=â•›0.33, Nâ•›=â•›51
Note: Models were generated using stepwise regression, and variables entered into the model are shown in Table 23.6.
The top row lists the feature sets that are available. The left column lists the emotional self-reports being pre-
dicted. R values correspond to the fit of the model (best fit models for each emotion are in bold). N values vary
because students may be missing data for a sensor. R values for Linear Regression Models (best fit models for each
emotion in bold). Empty cells mean that no fit model was found for that data set. N values vary because each case
corresponds to one emotion report crossed with the data for each sensor—mean, minimum value, and maximum
value corresponding to each sensor for the last problem before the report. Full data for all sensors is limited to a
subset of students.
for simultaneous use in classrooms in Massachusetts and Arizona. The four sensors, shown
in Figure 23.3, include a facial expression recognition system that incorporates a computa-
tional framework to infer a user’s state of mind,25 a wireless conductance bracelet based on
an earlier glove that sensed skin conductance developed at the MIT Media Lab, a pressure
mouse to detect the increasing amounts of pressure that students place on mice related to
Predict thinking
? Hint ? Hint
0.6 F sitForward 0.6 F sitForward
0.4 0.4
0.2 0.2
0.0 0.0
–250 –200 –150 –100 –50 0 –200 –150 –100 –50 0
Time from event (s) Time from event (s)
0.4 0.4
0.2 0.2
0.0 0.0
–250 –200 –150 –100 –50 0 –200 –150 –100 –50 0
Time from event (s) Time from event (s)
FIGURE 23.5
MindReader25 Camera Software output stream (probabilities of concentrating or thinking) for students report-
ing different confidence levels and frustration levels. The graphs show minutes of student activity before the
self-report of high or low confidence/frustration. Note students who have low confidence are “concentrating”
more than highly confident ones. Students who are not frustrated are thinking frequently. Each contiguous line
represents a single student episode, and the zero point on the X-axis represents the moment of the report of
confidence or frustration. The small letters (O, X, ?, F) indicate actions taken by the student (correct, incorrect,
hint, or sit forward.)
streams from facial detection software, can help tutors predict more than 60% of the vari-
ance of some student emotions, which is better than when these sensors are absent.
Future work consists of validating these models with new populations of students and
verifying that the loss of accuracy is relatively small. The final goal is to dynamically pre-
dict emotional states, goals, and attitudes of new students from these models created from
previous students. We are working on pedagogical strategies to help students cope with
states of negative emotion and support their return to on-task behavior19, as well as teacher
reports. Further down the line, we intend to create tutor modules that recompute these affec-
tive models as new student data arrives, thus producing self-improving tutoring software.
Acknowledgments
We acknowledge contributions to the sensor software development from Rana el
Kaliouby, Ashish Kapoor, Selene Mota, and Carson Reynolds. We also thank Joshua
Richman, Roopesh Konda, and Assegid Kidane at ASU for their work on sensor
manufacturing.
This research was funded by two awards: one from the National Science Foundation,
#0705554, IIS/HCC Affective Learning Companions: Modeling and Supporting Emotion during
Teaching, Woolf and Burleson (PIs) with Arroyo, Barto, and Fisher; and a second award
from the U.S. Department of Education to Woolf, B. P. (PI) with Arroyo, Maloy and the
Center for Applied Special Technology (CAST), Teaching Every Student: Using Intelligent
Tutoring and Universal Design to Customize the Mathematics Curriculum. Any opinions, find-
ings, conclusions, or recommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the funding agencies.
References
1. Graham, S. and Weiner, B. (1996). Theories and principles of motivation. In Berliner, D. and
Calfee, R. (Eds.), Handbook of Educational Psychology. New York: Macmillan, pp. 63–84.
2. Zimmerman, B.J. (2000). Self-efficacy: An essential motive to learn. Contemporary Educational
Psychology, 25, 82–91.
3. Wentzel, K. and Asher, S.R. (1995). Academic lives of neglected, rejected, popular, and contro-
versial children. Child Development, 66, 754–763.
4. Mueller, C.M. and Dweck, C.S. (1998). Praise for intelligence can undermine children’s and
performance. Journal of Personality and Social Psychology, 75(1), 33–52.
5. Picard, R.W., Papert, S., Bender, W., Blumberg, B., Breazeal, C., Cavallo, D., Machover, T.,
Resnick, M., Roy, D., and Strohecker, C. (2004). Affective Learning—A Manifesto. BT Technical
Journal, 2(4), 253–269.
6. Aleven, V., McLaren, B., Roll, I., and Koedinger, K. (2004). Toward tutoring help seeking:
Applying cognitive modeling to meta-cognitive skills. In Proceedings of the 7th International
Conference on Intelligent Tutoring Systems (ITS-2004). Berlin, Germany: Springer.
7. Zhou, X. and Conati, C. (2003). Inferring user goals from personality and behavior in a causal
model of user affect. In Proceedings of the International Conference on Intelligent User Interfaces,
Miami, FL, pp. 211–218.
8. Baker, R., Corbett, A.T. and Koedinger, K.R. (2001). Toward a model of learning data represen-
tations. In Proceedings of the 23rd Annual Conference of the Cognitive Science Society, Edinburgh,
U.K., August 1–4, 2001, pp. 45–50.
9. Arroyo, I., Beal, C.R., Murray, T., Walles, R., and Woolf, B.P. (2004). Web-based intelligent mul-
timedia tutoring for high stakes achievement tests. In J.C. Lester, R.M. Vicari, and F. Paraguaçu
(Eds.), Intelligent Tutoring Systems, 7th International Conference, ITS 2004. Maceiò, Brazil, pp.
468–477, Proceedings. Lecture Notes in Computer Science 3220. Berlin, Germany: Springer.
10. Arroyo, I., Ferguson, K., Johns, J., Dragon, T., Mehranian, H., Fisher, D., Barto, A., Mahadevan, S.,
and Woolf, B. (2007). Repairing disengagement with non invasive interventions. In International
Conference on Artificial Intelligence in Education, Marina del Rey, CA, July 09, 2007.
11. Woolf, B. (2009). Building Intelligent Tutors: Bridging Theory and Practice. San Francisco, CA:
Elsevier Inc./Morgan Kauffman.
12. Russell, S. and Norvig, P. (2002). Probabilistic reasoning systems, Chapter 14. In Artificial
Intelligence: A Modern Approach, 2nd Edn. Upper Saddle River, NJ: Prentice Hall.
13. Conati, C. and Mclaren, H. (2004). Evaluating a probabilistic model of student affect. In
Proceedings of ITS 2004, 7th International Conference on Intelligent Tutoring Systems, Lecture
Notes in Computer Science, Volume 3220/2004, Berlin/Heidelberg, Germany: Springer,
pp. 55–66.
14. D’Mello, S. and Graesser, A. (2007). Mind and body: Dialogue and posture for affect detec-
tion in learning environments. In Frontiers in Artificial Intelligence and Applications, Volume 158.
Amsterdam, the Netherlands: IOS Press.
15. Graesser, A.C., Chipman, P., King, B., McDaniel, B., and D’Mello, S. (2007). Emotions and
Learning with AutoTutor. In 13th International Conference on Artificial Intelligence in Education
(AIED 2007). R. Luckin, K. Koedinger, and J. Greer (Eds.). Amsterdam, the Netherlands: IOS
Press, pp. 569–571.
16. D’Mello, S.K., Picard, R.W., and Graesser, A.C. (2007). Towards an affect-sensitive AutoTutor.
Special issue on Intelligent Educational Systems IEEE Intelligent Systems, 22(4), 53–61.
17. Ekman, P. (1999). Facial Expressions. New York: John Wiley & Sons Ltd.
18. Burleson, W. (2006). Affective learning companions: Strategies for empathetic agents with real-
time multimodal affective sensing to foster meta-cognitive approaches to learning, motivation,
and perseverance. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA.
19. Arroyo, I., Cooper, D., Burleson, W., Woolf, B.P., Muldner, K., and Christopherson, R. (2009).
Emotion sensors go to school. In Proceedings of the 14th International Conference on Artificial
Intelligence in Education: Building Learning Systems that Care: From Knowledge Representation to
Affective Modelling. Amsterdam, the Netherlands: IOS Press, pp. 17–24.
20. Arroyo, I., Muldner, K., Burleson, W., Woolf, B.P., and Cooper, D. (2009). Designing affective
support to foster learning, motivation and attribution. Workshop on Closing the Affective Loop
in Intelligent Learning Environments. In 14th International Conference on Artificial Intelligence in
Education, Brighton, U.K., July 6–10, 2009.
21. Wigfield, A. and Karpathian, M. (1991). Who am I and what can I do? Children’s self-concepts
and motivation in achievement solutions. Educational Psychologist, 26, 233–261.
22. Draper, N. and Smith, H. (1981). Applied Regression Analysis, 2nd Edn. New York: John Wiley &
Sons, Inc.
23. Cooper, D., Arroyo, I., Woolf, B.P., Muldner, K., Burleson, W., and Christopherson, R. (2009).
Sensors model student self concept in the classroom. In International Conference on User Modeling,
Adaptation, and Personalization (UMAP 2009), Trento, Italy, June 22–26, 2009.
24. Dragon, T., Arroyo, I., Woolf, B.P., Burleson, W., El Kaliouby, R., and Eydgahi, H. (2008). Viewing
student affect and learning through classroom observation and physical sensors. In Proceedings
of the 9th International Conference on Intelligent Tutoring Systems, Montreal, Canada, June 23–27,
2008, pp. 29–39.
25. El Kaliouby, R. (2005). Mind-reading machines: Automated inference of complex mental states.
Unpublished Ph.D. thesis, University of Cambridge, Cambridge, U.K.
26. Royer, J.M. and Walles, R. (2007). Influences of gender, motivation and socioeconomic status on
mathematics performance. In D.B. Berch and M.M.M. Mazzocco (Eds.), Why is Math So Hard for
Some Children. Baltimore, MD: Paul H. Brookes Publishing Co., pp. 349–368.
27. Conati, C. and Maclaren, H. (2009). Empirically building and evaluating a probabilistic model
of user affect. User Modeling and User-Adapted Interaction, 19(3): 267–303.
28. Ekman, P. (1972). Universals and cultural differences in facial expressions of emotion. In J. Cole
(Ed.), Nebraska Symposium on Motivation 1971, vol. 19, pp. 207–283. Lincoln, NE: University of
Nebraska Press.
24
Capturing and Analyzing Student Behavior
in a Virtual Learning Environment: A Case
Study on Usage of Library Resources
Contents
24.1 Introduction......................................................................................................................... 339
24.2 Case Study: The UOC Digital Library.............................................................................340
24.3 Educational Data Analysis................................................................................................342
24.3.1 Data Acquisition.....................................................................................................343
24.3.2 UOC Users Session Data Set.................................................................................344
24.3.3 Descriptive Statistics..............................................................................................345
24.3.4 Categorization of Learners’ Sessions...................................................................346
24.4 Discussion............................................................................................................................ 349
24.4.1 Future Works........................................................................................................... 349
Acknowledgment......................................................................................................................... 349
References...................................................................................................................................... 349
24.1╇ Introduction
The Internet has completely changed what we understand as distance education.
Information and communication technologies had a great impact on distance education
and, rapidly, e-learning has become a common way to access education. Everyday, more
and more people use e-learning systems, environments, and contents for both training and
learning. Education institutions worldwide are taking advantage of the available technol-
ogy in order to facilitate education to a growing audience. Nowadays, most educational
institutions use learning management systems for providing learners with additional sup-
port in their learning process. According to Taylor [24], this is the fourth generation of
e-learning systems, spaces where there is an asynchronous process that allows students
and teachers to interact in an educational process expressly designed in accordance with
these principles. Indeed, these e-learning systems have evolved to what we know as vir-
tual learning environments, virtual spaces where users, services, and content converge,
creating learning scenarios where multiple interactions are generated.
One of the main elements in the learning process is the digital library, a wide service that
provides access to all the learning resources, ranging from books to specific documents
in digital format, and serves also as a gateway for accessing other resources outside the
institution. Following the recommendations from the new European Higher Education
339
340 Handbook of Educational Data Mining
Area (also known as the Bologna process), the learning process must become more
learner-centered,
� rather than content-centered. Therefore, the classical model of teachers
pushing content toward learners is obsolete. Instead, learners must acquire and develop
competences according to a learning process based on activities specially designed to do
so [1]. These activities involve the use of different content or, even better, force learners to
search and decide by themselves which content is the most suitable for them. This fact
promotes the use of services such as digital libraries or learning object repositories, which
are also becoming a common service in most virtual learning environments and are con-
sidered to be strategic for development purposes [9]. Nevertheless, the use of the digital
library is not really integrated into the learning process, as it is seen as an additional ser-
vice or resource. In fact, only a small percentage of learners can be considered users of the
digital library as expected. Most of the students never access the digital library or they
only do it at the beginning of the academic semester. The digital library can be accessed
from several places in the virtual learning environment, and this can explain the differ-
ences among learners with respect to their learning goals: do they need a specific resource
in a particular moment and, therefore, they access the digital library? Or do they browse
the digital library as a basic competence related to the fact of being online learners? These
and other similar questions could be answered by analyzing the real usage of the digital
library. One of the most interesting possibilities in any e-learning environment is tracking
user navigational behavior for analysis purposes, as it may help to discover unusual facts
about the system itself, improve understanding of users’ behavior, and also can provide
useful information for adaptation and personalization purposes. In fact, the main goal
of educational data mining is reintroducing such discovered knowledge in the learning
management system again, following an iterative cycle of continuous improvement [21].
As stated in [19], it is possible to improve the instructional design of learning activities
if the behavior of learners in a particular context is analyzed. According to this premise,
we intend to use a session classification technique for identifying learning goals within
a single session, trying to find the relationship between the digital library usage and the
learning scenario where it is accessed from.
This chapter is organized as follows: Section 24.2 describes the institutional environ-
ment that served as a case study for analyzing user behavior with respect to the digital
library access. In Section 24.3, we describe the proposed strategy for capturing user inter-
actions within the virtual learning environment and the further analysis on such usage
data related to the digital library. Finally, the most important conclusions of this work and
current and future research lines are summarized in Section 24.4.
on the virtual campus, constituting a true university community that uses the Internet
to create, structure, share, and disseminate knowledge. Within the UOC virtual campus,
each subject has a virtual classroom for the teaching and learning process. Virtual class-
rooms are the virtual meeting points for learning activities, following a student-centered
model [22]. The learning model determines the interaction produced in the virtual learn-
ing environment. The study of learners’ interaction can allow us to obtain new knowledge
to validate and improve the virtual learning environment design, and even the learning
model itself. To characterize the users’ navigation, sessions may also contribute to improve
the virtual campus design.
The virtual classroom (see Figure 24.1) is the place where the students have access to all
the information related to their learning process. In the classroom, they find the tools to
interact and share resources with the instructor and the other students. The system noti-
fies students about the changes that have been produced from their last visit. The virtual
classroom is structured in four main areas: planning, communication, resources, and eval-
uation. The planning area contains the description of all the elements of the course, the
learning, and the calendar of activities and assignments. The communication area provides
the tools for students to communicate among themselves and with their instructor. This is
where the shared mailboxes or boards can be found. In this area, they also find the list of
the classroom participants and the information whether they are connected to the virtual
campus at that moment. The resource area contains the links to all the necessary digital
resources, as well as the listing of nondigital resources that are also a requirement of the
subject. Most of those links are, in fact, links to the university digital library. Finally, the
evaluation area contains the tools for assignment delivery as well as the direct accesses to
the applications related to the evaluation process, like for example, the qualifications view.
The digital library integrates all resources and digital contents of the virtual campus. It
allows to look up and access digital resources and to request the resources in nondigital
format. The digital library can be accessed from the main menu of the virtual campus and
also from the virtual classroom even though links can also be added from other learning
resources as, for example, the discussion boards. To guide the study of the digital library
usage, three scenarios have been established. These scenarios characterize the way stu-
dents can access and interact with the library, according to the different navigation paths
available in the virtual campus, which are related to different possible user behaviors. The
first scenario is defined by users that log in the virtual campus and they navigate directly
to the digital library using the main menu button. It includes all those students who are
studying offline and need to access a resource or to look for a book, so they connect to
FIGURE 24.1
Screenshots of the virtual campus: (a) shows an example of the campus home page, (b) shows a virtual class-
room overview, and (c) shows the interaction space of the classroom.
342 Handbook of Educational Data Mining
the virtual campus and access the digital library. The second scenario takes into account
accessing the digital library from the virtual classroom. In this scenario, students access
directly the resources specific to the subject via the links in the resource area of the virtual
classroom. This access is very frequently used since it simplifies the navigation and search
for specific resource in the digital library. The third scenario of the use of the digital library
includes accessing the digital library from the discussion boards of the virtual classrooms.
The messages posted in the boards can contain a link related to the discussion topic. This
link could be a resource or a book in the digital library. In this work, we are interested in
understanding and explaining the behavior of the students not in the digital library but
their navigation before accessing it. This information will be used to improve the learning
process as explained later. Basically, we will use clustering techniques in order to deter-
mine the relationship between the proposed usage scenarios and the observed naviga-
tional behavior.
Several authors [2,10,18] have proposed to analyze usage in order to provide a better
support for searching and browsing, personalized services, or improving user interface
design, among several possibilities. Digital libraries combine a web-based interface as a
front-end with a server acting as a back-end where all the searches and queries are exe-
cuted. Both the front-end and the back-end generate their own log files that are usually
analyzed separately, the former for studying how users interact with the user interface and
the latter for analyzing browsing and searching informational behaviors. In our case, we
are not interested in predicting which learners will access the digital library and when, but
in improving the understanding of the learners’ needs with respect to the digital library
as a basic service of the virtual classroom. The real usage of the digital library has been
always a critical issue for teachers and instructional designers. Preliminary experiments
[6] show that most learners do not use the digital library and other sources following a
continuous basis, but they concentrate their accesses during the first days of the academic
semester. Surprisingly, the results from an internal institutional survey show that the digi-
tal library is one of the most well-rated resources in the virtual campus, which is somehow
contradictory with the real usage data. Therefore, it becomes necessary to analyze with
more detail the real digital library usage patterns and try to establish their relationship
with the other areas and elements of the virtual campus.
only navigational behavior but also the structure of the different services (we call them
regions) used during the learning process. Descriptive statistics are used to extract useful
information from such variables and Principal Component Analysis is used to establish
the most important relationships among the different regions visited by learners. Then,
learners are clustered according to their navigational behavior.
1. The initial home page (INIT), where the student reaches the university institu-
tional information, and has a summary of the activity of all the courses he or she
is enrolled into. This page can be accessed again from any point in the virtual
campus, so students use it as a “synchronization” point, when they change from
an activity (i.e., reading the mailbox) to another (i.e., going to the digital library).
2. The mail box (MAIL), where private messages are stored. The student can access
this service by four shortcuts situated on different parts of the virtual campus.
Although most students have now other mail addresses outside the virtual cam-
pus, it is still very common that they use this one for all the official communication
with other students, teachers, and so on.
3. The classroom (CLASS). In this space, the students have all the information about
each subject. In addition, the Teaching Plan, a document that describes the whole
learning process, can also be accessed from this region.
4. Virtual learning spaces (SPACES). These spaces are situated on each classroom,
and typically can be classified as forums, debate spaces, and news boards. In fact,
these spaces are special mailboxes where students can read (and sometimes write)
messages that are shared among all students in the same virtual classroom.
5. The digital library (LIBRARY), where the students can find the information they
need for complementing their learning process.
6. Other spaces (OTHERS). Under this term, we aggregate different administrative
resources, such as secretary services, community, research, news, agenda, files,
help, personalization aspects, and profile management, which are not mandatory
for learning purposes.
From the whole session data set, we filter out those where there is no interaction with the
digital library. The final analyzed data set has a total of 65,226 sessions, which is large
Capturing and Analyzing Student Behavior in a Virtual Learning Environment 345
enough to perform a clustering analysis. These sessions are generated by 12,446 learners
(out of 29,531). From this data set, we compute the following information: the timestamp
of the session; the relative session number in the current day; the total number of relevant
clicks performed during the session; the initial click, that is, which one of the six regions
aforementioned is visited from the initial page; the session length in seconds; the day of
the week, in order to know whether students connect on weekends or holidays or not;
the hour segment when they started the session, in order to know if they connect in the
morning, afternoon, evenings, or at night; six values containing the probability of being
in each one of the six regions, namely P(Ri); and 36 values with the number of times that
the student goes from each region to each other region of the virtual campus, also normal-
ized, namely P(Riâ•›→â•›Rj). Notice that some of these values will be zero as the virtual learning
environment does not allow learners to navigate among all the regions.
Notice that all these components use only variables related to navigation, that is, the
relative day of the academic semester, session length, and the other variables described
previously are not considered relevant by the PCA (for these components). Therefore, the
further clustering analysis will be performed only on the 36 variables describing the prob-
ability of accessing one of the six regions of the virtual campus from another. We will not
include the probability of being in a particular region as these are directly related to the
other 36 navigational variables.
0.037 MAIL
0.037
0.074
0.037
INIT 0.148
0.074 0.111
0.037
0.037 0.074 0.037
SPACES LIBRARY 0.074
0.037
0.037
OTHERS CLASS
0.037 0.037
1.000
INIT LIBRARY
0.074
0.042
0.083 0.417
0.083 SPACES
FIGURE 24.2
Directed graphs representing learners’ navigation described using Kâ•›=â•›3 medoids.
1. The first clearly defined cluster illustrates a frequent kind of navigation on virtual
learning environments, the connections of students that are only interested on
checking a specific resource, usually responding to a premeditated behavior. This
behavior matches with one of the defined scenarios. Notice that the graph shows
that students check the virtual library, and choose to end the session just after
obtaining the desired answer. This fact cannot be considered an isolated case.
Notice that more than 19,000 navigational sessions behave in a similar way to the
graph corresponding to cluster 3.
2. The second graph shows a navigation that might be considered a classroom-driven
activity model. Students focus on the spaces found in the classroom, mainly con-
sulting the main novelties of the different spaces of the subjects where they are
enrolled. This cluster matches with the second scenario definition, where there
are students that access the digital library from the classroom, using the avail-
able links to the specific course resources.
3. In the third cluster, we aggregate the samples from sessions that are focused on
rich campus navigation. Students use the different resources available and even-
tually can check the virtual library to satisfy a punctual need. In fact, this cluster
is a mixture of the two main scenarios, students that directly access the digital
library and also students accessing it after visiting the discussion boards. If Kâ•›=â•›4
clusters were used, this group would split in the two cases mentioned.
Capturing and Analyzing Student Behavior in a Virtual Learning Environment 349
24.4╇ Discussion
In this chapter, we propose a simple algorithm to model the navigation on the virtual cam-
pus, taking into account the information from the logged sessions. The proposal is cen-
tered on an exploratory study of the campus activity related to the library resources. More
concretely, we propose an algorithm to extract an aligned feature set from the different
students’ sessions, which can be used as a previous step for any standard learning algo-
rithm. On the other hand, the representation provided allows a straightforward graphical
representation of the users’ behavior using navigational graphs. We applied the K-means
algorithm to clusterize the session data, extracting a set of prototypes that might help the
visualization of the users’ navigation on our exploratory study.
As every session is uniquely identified, it is possible to group sessions in recurrent visits
for each user. Understanding sessions is the first step toward the creation of a personaliza-
tion system [23] that takes into account not only what a user is doing in a single session but
also all his or her previous activity in the virtual learning environment. In fact, learners in
a virtual learning environment establish an ongoing relationship, that is, they maintain a
continuous activity during a long period of time (ranging from weeks to years), changing
their goals according to the context [16].
Acknowledgment
This work has been partially supported by the Spanish government project
PERSONAL(ONTO) under grant ref. TIN2006-15107-C02.
References
1. Ade, J. et al. (1999). The Bologna Declaration. Available at http://www.bologna-bergen2005.
no/Docs/00-Main doc/990719BOLOGNA DECLARATION.PDF
2. Bollen, J. and Luce, R. (2002). Evaluation of digital library impact and user communities by
analysis of usage patterns. D-Lib Magazine, 8(6).
350 Handbook of Educational Data Mining
3. Chen, C.C. and Chen, A.P. (2007). Using data mining technology to provide a recommendation
service in the digital library. The Electronic Library, 25(6), 711–724.
4. Cohen, J., Cohen, P., West, S.G., and Aiken, L.S. (2003). Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences, 3rd edn. Hillsdale, NJ: Lawrence Erlbaum Associates.
5. Cooley, R., Mobasher, B., and Srivastava, J. (1999). Data preparation for mining World Wide
Web browsing patterns. Knowledge and Information Systems, 1(1), 5–32.
6. Ferran, N., Casadesús, J., Krakowska, M., and Minguillón, J. (2007). Enriching e-learning meta-
data through digital library usage analysis. The Electronic Library, 25(2), 148–165.
7. Fok, A.W.P., Wong, H.S., and Chen, Y.S. (2005). Hidden Markov model based characterization
of content access patterns in an e-learning environment. In Proceedings of the IEEE International
Conference on Multimedia and Expo, Amsterdam, the Netherlands, July 6–9, 2005, pp. 201–204.
8. Fu, Y., Sandhu, K., and Shih, M.-Y. (2000). A generalization-based approach to clustering of
Web usage sessions. In Proceedings of the International Workshop on Web Usage Analysis and User
Profiling, San Diego, CA, Lecture Notes in Computer Science, Vol. 1836. Berlin, Germany: Springer,
pp. 21–38.
9. Joint, N. (2005). Strategic approaches to digital libraries and virtual learning environments
(VLEs). Library Review, 54(1), 5–9.
10. Jones, S., Cunningham, S.J., and McNab, R. (1998). Usage analysis of a digital library. In
Proceedings of the Third ACM Conference on Digital Libraries, Pittsburgh, PA, June 23–26, 1998, pp.
293–294.
11. Kaufman, J. and Rousseeuw, P.J. (1987). Clustering by means of medoids. In Statistical Data
Analysis Based on the L1 Norm, Y. Dodge (Ed.). Amsterdam, the Netherlands: North Holland/
Elsevier, pp. 405–416.
12. Kitsuregawa, M., Pramudiono, I., Ohura, Y., and Toyoda, M. (2002). Some experiences on large
scale Web mining. In Proceedings of the Second International Workshop on Databases in Networked
Information Systems, Aizu, Japan, December 16–18, 2002, Lecture Notes in Computer Science, Vol.
2544. Berlin, Germany: Springer, pp. 173–178.
13. Linde, Y., Buzo, A., and Gray, R. (1980). An algorithm for vector quantizer design. IEEE
Transactions on Communications, 28, 84–94.
14. Ming-Tso, M. and Mirkin, B. (2007). Experiments for the number of clusters in K-means. In
Proceedings of the Third Portuguese Conference on Artificial Intelligence, Guimarães, Portugal,
Lecture Notes in Computer Science, Vol. 4874. Berlin, Germany: Springer, pp. 395–405.
15. Mobasher, B., Dai, H.H., Luo, T., Sun, Y.Q., and Zhu, J. (2000). Integrating web usage and con-
tent mining for more effective personalization. In Proceedings of the First International Conference
on Electronic Commerce and Web Technologies, London, U.K., Lecture Notes in Computer Science,
Vol. 1875. Berlin, Germany: Springer, pp. 165–176.
16. Mor, E., Garreta-Domingo, M., Minguillón, J., and Lewis, S. (2007). A three-level approach
for analyzing user behavior in ongoing relationships. In Proceedings of the 12th International
Conference on Human-Computer Interaction, Beijing, China, July 22–27, 2007, pp. 971–980.
17. Mor, E., Minguillón, J., and Santanach, F. (2007). Capturing user behavior in e-learning envi-
ronments. In Proceedings of the Third International Conference on Web Information Systems and
Technologies, Barcelona, Spain, March 3–6, 2007, pp. 464–469.
18. Nicholson, S. (2005). Digital library archaeology: A conceptual framework for understanding
library use through artifact-based evaluation. The Library Quarterly, 75(4), 496–520.
19. Pahl, C. (2006). Data mining for the analysis of content interaction in web-based learning and
training systems. In Data Mining in E-Learning, Romero, C. and Ventura, S. (Eds.). Southampton,
U.K.: WIT Press, pp. 41–56, ISBN 1-84564-152-3.
20. Ray, S. and Turi, R. (1999). Determination of number of clusters in k-means clustering and
application in colour image segmentation. In Proceedings of the Fourth International Conference
on Advances in Pattern Recognition and Digital Techniques, Calcutta, India, December 27–29, 1999,
pp. 137–143.
21. Romero, C. and Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert
Systems with Applications, 33(1), 135–146.
Capturing and Analyzing Student Behavior in a Virtual Learning Environment 351
22. Sangrà, A. (2002). A new learning model for the information and knowledge society: The case
of the UOC [online]. International Review of Research in Open and Distance Learning, 2(2). [Date
accessed: 18/08/2008].
23. Smeaton, A.F. and Callan, J. (2005). Personalisation and recommender systems in digital librar-
ies. International Journal on Digital Libraries, 5(4), 299–308.
24. Taylor, J.C. (1999). Distance education: The fifth generation. In Proceedings of the 19th ICDE
World Conference on Open Learning and Distance Education, Vienna, Austria, June 20–24, 1999.
25
Anticipating Students’ Failure As Soon As Possible
Cláudia Antunes
Contents
25.1 Introduction......................................................................................................................... 353
25.2 The Classification Problem................................................................................................ 354
25.2.1 Problem Statement and Evaluation Criteria.......................................................354
25.2.2 Anticipating Failure As Soon As Possible........................................................... 355
25.3 ASAP Classification............................................................................................................ 356
25.3.1 Problem Statement.................................................................................................. 356
25.3.2 CAR-Based ASAP Classifiers................................................................................ 357
25.4 Case Study........................................................................................................................... 359
25.5 Conclusions.......................................................................................................................... 362
References...................................................................................................................................... 362
25.1╇ Introduction
With the spread of information systems and the increased interest in education, the quan-
tity of existing data about students’ behaviors has exploded in the last decade. Those
datasets are usually composed of records about students’ interactions with several cur-
ricular units. On one hand, these interactions can be related to traditional courses (taught
at traditional schools) that reveal the success or failure of each student on each assess-
ment element of each unit that a student has attended. On the other hand, there are the
interactions with intelligent tutoring systems (ITS), where each record stores all students’
interactions with the system. In both cases, records for each student have been stored at
different instants of time, since both attendance of curricular units and corresponding
assessment elements, and ITS interactions, occur sequentially, in a specific order. Although
this order can be different for each student, the temporal nature of the educational process
is revealed in the same way: each student’s record corresponds to an ordered sequence of
actions and results.
Once there are large amounts of those records, one of their possible usages is the auto-
matic prediction of students’ success. Work on this area has been developed, with the
research being focused mainly on determining students’ models (see, e.g., [4,5]), and more
recently on mining frequent behaviors [2,12]. Exceptions to this general scenario are the
works [3,14,17] that try to identify failure causes. For predicting students’ success, exist-
ing data per se is not enough, but combined with the right data mining tools, can lead to
very interesting models about students behavior. Classification is the natural choice, since
it can produce such models, based only on historical data (training datasets as the ones
353
354 Handbook of Educational Data Mining
described above). Once these models are created, they can be used as predictive tools for
new students’ success.
Despite the excellent results of classification tools on prediction in several domains, the
educational process presents some particular issues, such as temporality, that bring addi-
tional challenges to this task.
In this chapter, we will describe how traditional classifiers can be adapted to deal
with these situations after they have been trained in full and rich datasets. We cease the
opportunity to succinctly explain the classification problem and describe the methodol-
ogy adopted to create ASAP classifiers (as soon as possible classifiers). The chapter will also
include a detailed study about the accuracy of these new classifiers when compared with
the traditional ones, both applied on full and reduced datasets.
In order to demonstrate that ASAP classifiers can anticipate students’ failure in an inter-
esting time window, the chapter will present a case study about students’ performance
recorded in the last 5 years in a subject in an undergraduate program.
The rest of the chapter is organized as follows: next, the classification problem is
described, giving particular attention to the most common measures for their efficacy; in
this section, the problem of ASAP classification is also introduced. In Section 25.3, a meth-
odology for the ASAP classification is proposed. Section 25.4 presents the case study for
evaluating the new methodology, concluding with a deep study on the impact of different
factors. The chapter closes with a critical analysis of the achieved results and points out for
future directions to solve the problem.
Negative FP TN where h(xi) is the estimation for xi ’s class and c(xi) its own
classification.
FIGURE 25.2 In order to distinguish the ability to classify instances
Confusion matrix for binary from different classes, it is usual to use the confusion matrix.
classification.
This is a kâ•›×â•›k matrix (with k the number of classes), where
each entry xij corresponds to the percentage of instances of
class i that are classified as in class j. Therefore, a diagonal matrix reveals an optimal
classifier.
When there are only two classes, it is usual to talk about positive and negative instances:
instances that implement the defined concept and the ones that do not implement it,
respectively. In this case, it is usual to designate the confusion matrix diagonal entries as
true positives (TP), and true negatives (TN), and the others as false positives (FP) and false nega-
tives (FN), as depicted in Figure 25.2.
In the binary case, it is useful to use some additional measures, namely sensitivity and
specificity. While the first one reflects the ability to correctly identify positive cases, the
second one reveals the ability to exclude negative ones. Hence, sensibility is given by the
ratio between the number of instances correctly classified as positives (TP) and the num-
ber of real positive instances (TPâ•›+â•›FN), Figure 25.3.
On the other side, specificity is given by the ratio between the number of instances cor-
rectly classified as negative (TN) and the number of real negative instances (TNâ•›+â•›FP), as
shown in Figure 25.4. In this manner, specificity is just sensibility for the negative class.
FIGURE 25.3
Formula for sensibility.
FIGURE 25.4
Formula for specificity.
356 Handbook of Educational Data Mining
knowledge. At the same time, students are asked to interact with the system, either almost
continuously or at specific instants of time; the process occurs during some time interval.
The results of these interactions are usually used to assess students’ success and progress.
Another important characteristic of the educational process is the fact that the success or
failure of the student is determined at a particular instant of time, usually at the end of
the process.
Consider, for example, the enrolment of a student to an undergraduate subject: he or
she begins to interact with the curricular unit, usually with some small assignments that
have a small weight on the final grade; as the time goes on, the student is confronted with
harder tasks and at the end of the semester he or she has to be evaluated on a final exam
and has to deliver a final project. His final mark is determined according to a mathemati-
cal formula that weights the different interactions. In this case, the best classifier is the one
that corresponds to that mathematical formula, which can be represented accurately by all
the referred approaches. However, despite the perfect accuracy of this method, the goal is
not achieved: it is not able to predict in advance if some student will fail.
To our knowledge, in order to accomplish the goal of predicting students result in
advance, the training of classifiers have to suffer some adaptations, namely on weight-
ing the different attributes based on their instance of occurrence. In this manner, oldest
attributes have higher weights in the classification than the youngest ones. However, this
is against the educational process, since the classifier would be almost the reverse of the
optimal one.
In the next section, a new approach is proposed, avoiding this contra-nature approach.
Note, however, that nothing is said about the training process of classifiers. Indeed, the
use of a training set, composed of historical records, is compatible with the notion that
these historical instances are fully observable, which means, that classifiers can be trained
using the traditional methods without needing any adaptation.
From this point forward, classifiers that work in the context of this formulation are
denominated, as soon as possible classifiers—ASAP classifiers, in short.
These new classifiers can be trained using two different strategies: a first one, based on
the usage of the entire set of attributes, named the optimistic strategy, and a second one, the
pessimistic strategy, that train the classifier only using the observable attributes.
The pessimistic strategy converts the problem into the traditional problem of classifi-
cation, by reducing each training instance from its original m-dimensional space, to an
n-dimensional space, with nâ•›<â•›m. Clearly, this strategy does not use all the information
available at classification time. Indeed, it wastes the historical values of the unobservable
attributes, existing in the training set. For this reason, it is expected that the pessimistic
strategy will lead to the creation of less accurate classifiers.
On the other hand, the optimistic strategy needs to train classifiers from m-dimensional
instances that can be applied to n-dimensional ones. Again, it is possible to consider two
approaches, either to convert the learnt classifier, a function from Am to C, into a function
from An to C, or to convert the n-dimensional instances into m-dimensional ones.
Note that both approaches require some nontrivial transformation. In the case of the
second approach, it tries to enrich instances that are not fully observable, which can be
achieved with any method capable of estimating unobservable attributes from �observable
ones. Next, an approach based on Class Associations Rules (CAR) is described.
• First, a classifier is trained based on the entire training set D, for example, using
decision trees learners.
• Second, for each unobservable attribute ai,
• It creates a subset of D, Di, with instances characterized by attributes a1 to an,
followed by attribute ai; this last attribute is set to assume the role of class.
• And subsequently it finds the set of all classification association rules in set Di,
that satisfy chosen levels of confidence, support and accuracy, CARi.
• Third, for each unobservable attribute in instance z, zi
• It identifies the rules from CARi that match instance z, CAR zi.
• Among the consequents of rules on CARzi, it chooses the best value for filling
in the value of zi, say zi′.
• Finally, in order to predict the class label of instance z, it creates a new instance
with m attributes z′, such that z′ =â•›z1z2…znzn+1′…zm′ and submits it to the learnt
classifier.
Note that a similar methodology can be adopted even when other estimators are used for
predicting the value of unobservable attributes. It is only necessary to define what “the
best value” means. In the case of the CAR-based ASAP classifier, the best value results
from combining the different matching rules and choosing the one that better matches the
instance. The combination of matching rules is necessary to identify the different possible
values and their corresponding interest, since possibly there are several rules that match
with a unique instance.
An instance x is said to match a class association rule, r, designated x⎥ =â•›r, if and only if
↔ ∀i ∈ {1, … , n} : xi is missing ∨ xi = υi
A rule better matches an instance than another one if the first is more specific than the
�second, which means that it has more matching attributes with the instance, and fewer
missing values. In this manner, more restrictive rules are preferred. Note that despite vari-
ables xi with i ≤ n are observable, some instances may present missing values for those
variables, since their values may not be recorded. For example, the rule a1â•›=â•›υ1â•›∧â•›a2â•›=â•›υ2â•›∧â•›anâ•›=â•›
υnâ•›→â•›Yâ•›=â•›y is a better match than a1â•›=â•›υ1â•›∧â•›a2â•›=â•›υ2â•›→â•›Yâ•›=â•›y, since it is more specific than the sec-
ond one (it has an additional constraint on the value of an).
The major disadvantage of the methodology proposed is that the existence of rules that
match with some instance is determinant for the success of the final classification. Indeed,
if it is not possible to estimate the value for each nonobservable attribute, then most of the
times the classifier cannot improve its accuracy.
One important advantage of CAR-based ASAP classifier is the direct influence of min-
imum confidence and minimum support thresholds on the number of matching rules.
Definitely, with lower levels of support, the number of matching rules increase exponen-
tially (a phenomenon well studied in the area of association analysis), and lower levels
of confidence decrease the certainty of the estimation, increasing the number of errors
made. In order to increase the accuracy of the estimation of unobservable attributes, a
rule is selected to estimate a value, only if it has an accuracy higher than a user-specified
threshold.
Anticipating Students’ Failure As Soon As Possible 359
Test Proj
A B C F NA A B C F NA
A B C F NA A B C F NA A F B C NA
App App Fail Fail App App Fail Fail Fail App App Fail
Ex8 Exam1 Ex4
A B C F NA A B C F NA
A B C F NA
App App Fail Fail Fail App App App Fail Fail App App App Fail Fail
FIGURE 25.5
Discovered decision trees for optimal (left) and pessimistic (right) classifiers.
360 Handbook of Educational Data Mining
Finally, CAR-based ASAP classifier corresponds to the optimal one discovered and is
tested on the estimated testing set. This set is created by extending the small testing set
with estimations for unobservable attributes, creating a new full testing set.
Results confirm the expectations. CAR-based ASAP classifier presents relative suc-
cess, when compared with pessimistic approaches. Indeed, for lower levels of support
(0.5%), the accuracy achieved is always above the accuracy of the pessimistic classifier
(Figure 25.6—left).
It is important to note that for higher levels of support, the discovered rules do not cover
the majority of situations of recorded students’ results. This is explained by the sparse
data, resulting from the large number of evaluation moments (variables): there are only a
few students with similar behaviors. In order to find them, it is necessary to consider very
low levels of support.
Additionally, note that decreasing the level of confidence would decrease the quality of
the discovered rules. See that for the fixed levels of support the accuracy of the classifier
decreases when confidence also decreases. This is because rules with lower confidence are
more generic and less predictive than the ones with larger confidence.
Another interesting result is on predicting failure: ASAP classifier (Figure 25.6—right)
has always better specificity levels than the pessimistic one. This reflects the ability of our
classifier to cover both positive and negative cases with the same efficacy, since it both
looks for failure and success rules.
Accuracy vs. confidence and support Sensibility vs. confidence and support
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50%
Confidence Confidence
10.0% 5.0% 1.0% 10.0% 5.0% 1.0%
0.5% Optimal Pessimistic 0.5% Optimal Pessimistic
FIGURE 25.6
Accuracy, sensibility, and specificity for different levels of support and confidence (for a minimum CAR accu-
racy of 50%).
Anticipating Students’ Failure As Soon As Possible 361
A look at the percentage of correct estimation of attribute values with CAR again shows
that the accuracy increases with the decrease of support. However, the increase of cor-
rect values is not directly correlated with the decrease of missing values. Indeed, when
confidence decreases, missing values also decrease but the correct percentage does not
(Figure 25.7).
Again, this is due to the different confidence levels of discovered rules: if lower levels of
confidence are preferred, then there will be more discovered rules that would cover more
instances, and then they will be used to estimate more unobservable values. However,
since confidence is not high, the accuracy of the rule is not enough, and missing values
are replaced by wrong ones. With higher levels of confidence, the accuracy of rules will
be higher, and the number of wrong values will be reduced, that is, missing values will
prevail.
Note that the most important factor on the accuracy of the CAR-based ASAP classifier
is the accuracy of the discovered rules. Indeed, interesting results appear when the mini-
mum cutoff for CAR accuracy does not impair the estimation of values (Figure 25.8).
High levels of CAR accuracy exclude most of the discovered rules, which results in attrib-
uting missing values for the unobservable attribute. It is important to note that C4.5 in par-
ticular presents better results on dealing with missing values than for wrong estimations.
Another interesting fact is that the tree discovered with the pessimistic approach has
the same number of leaves as the optimistic one (Figure 25.5). However, while the opti-
mal classifier tests the primary attributes (the ones that have a minimum threshold for a
Correct values (Att 13) Missing values (Att 13) Wrong values (Att 13)
1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50%
Confidence Confidence Confidence
10.0% 5.0% 1.0% 0.5% 10.0% 5.0% 1.0% 0.5% 10.0% 5.0% 1.0% 0.5%
Correct values (Att 14) Missing values (Att 14) Wrong values (Att 14)
1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50%
Confidence Confidence Confidence
10.0% 5.0% 1.0% 0.5% 10.0% 5.0% 1.0% 0.5% 10.0% 5.0% 1.0% 0.5%
Correct values (Att 15) Missing values (Att 15) Wrong values (Att 15)
1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50%
Confidence Confidence Confidence
10.0% 5.0% 1.0% 0.5% 10.0% 5.0% 1.0% 0.5% 10.0% 5.0% 1.0% 0.5%
FIGURE 25.7
Impact of support and confidence on the estimation of unobservable attributes (for a minimum CAR accuracy
of 50%).
362 Handbook of Educational Data Mining
Accuracy vs. confidence and CAR accuracy (sp1%) Accuracy vs. confidence and CAR accuracy (sp0.5%)
1.0 1.00
0.9 0.90
0.8 0.80
0.7 0.70
0.6 0.60
0.5 0.50
100%95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50%
Confidence Confidence
80.0% 70.0% 60.0% 80.0% 70.0% 60.0%
50.0% Optimal Pessimistic 50.0% Optimal Pessimistic
FIGURE 25.8
Impact of accuracy and confidence on the estimation of unobservable attributes.
student to have success—Project and Exams), the pessimistic one tests the attribute Test
and some of the weekly exercises. Little changes on the teaching strategy (changing the
order of presenting concepts) will invalidate the pessimistic classifier.
25.5╇ Conclusions
In this chapter, we introduce a new formulation for predicting students’ failure, and pro-
pose a new methodology to implement it. Our proposal makes use of classifiers, trained
as usual, using all available data, and the estimation of unobservable data (attributes) in
order to predict failures as soon as possible. In our case, this estimation is based on the
discovery of class association rules.
Experimental results show that our methodology can overcome the difficulties of
approaches that do not deal with the entire set of historical data, and by choosing the best
parameters (confidence, support, and rule accuracy), the results are closer to the optimal
classifier found in the same data. However, the choice of the best parameters is a difficult
task, followed by the traditional problems related with the explosion on the number of dis-
covered rules. Other methodologies able to estimate the values of unobservable variables
(the EM algorithm [6] is just one of such possibilities) can also be applied to determine
ASAP classifiers. A study of their application, advantages, and disadvantages is mandatory
in order to understand the total potential of ASAP classifiers.
References
1. Agrawal, R., T. Imielinsky, and A. Swami. 1993. Mining association rules between sets of items
in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data. ACM
Press, Washington, DC, pp. 207–216.
2. Antunes, C. 2008. Acquiring background knowledge for intelligent tutoring systems. In
Proceedings of the International Conference on Educational Data Mining. Montreal, Canada, June
20–21, 2008, pp. 18–27.
Anticipating Students’ Failure As Soon As Possible 363
3. Antunes, C. 2009. Mining models for failing behaviors. In Proceedings of the Ninth International
Conference on Intelligent Systems Design and Applications. Pisa, Italy, November 30–December 02,
2009, IEEE Press.
4. Baker, R. and A. Carvalho. 2008. Labeling student behavior faster and more precisely with text
replays. In Proceedings of the First International Conference on Educational Data Mining. Montreal,
Canada, June 20–21, 2008, pp. 38–47.
5. Beck, J.E. 2007. Difficulties in inferring student knowledge from observations (and why should
we care). In Workshop Educational Data Mining—International Conference Artificial Intelligence in
Education. Marina del Rey, CA, 2007, pp. 21–30.
6. Dempster, A.P., N.M. Laird, and D.B. Rubin. 1997. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society Series, 39: 1–38. Blackwell Publishing.
7. Domingos, P. and G. Hulten. 2000. Mining high-speed data streams. In Proceedings of the Sixth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, MA,
ACM Press, New York, pp. 71–80.
8. Mitchell, T. 1982. Generalization as search. Artificial Intelligence, 18 (2): 223–236.
9. Nilsson, N. 1965. Learning Machines: Foundations of Trainable Pattern-Classifying Systems.
McGraw-Hill, New York.
10. Quinlan, J.R. 1993. C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
11. Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1(1): 81–106. Kluwer Academic
Publishers.
12. Romero, C., S. Ventura, P.G. Espejo, and C. Hervas. 2008. Data mining algorithms to classify
students. In Proceedings of the First International Conference on Educational Data Mining. Montreal,
Canada, June 20–21, 2008, pp. 8–17.
13. Scheffer, T. 2001. Finding association rules that trade support optimally against confidence. In
Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases
(PKDD’01). Springer-Verlag, Freiburg, Germany, pp. 424–435.
14. Superby, J.F., J.-P. Vandamme, and N. Meskens. 2006. Determining of factors influencing the
achievement of first-year university students using data mining methods. In Proceedings of the
Eighth Intelligent Tutoring System: Educational Data Mining Workshop (ITS’06). Jhongali, Taiwan,
pp. 37–44.
15. Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York.
16. Vapnik, V. and A. Chervonenkis. 1971. On the uniform convergence of relative frequencies of
events to their probabilities. Theory of Probability and its Applications, 16(2): 264–280. SIAM.
17. Vee, M.H., B. Meyer, and K.L. Mannock. 2006. Understanding novice errors and error paths in
object-oriented programming through log analysis. In Intelligent Tutoring System: Educational
Data Mining Workshop. Jhongli, Taiwan, June, 2006, pp. 13–20.
18. Witten, I. and E. Frank. 2000. Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations. Morgan Kaufmann, San Mateo, CA.
26
Using Decision Trees for Improving AEH Courses
Contents
26.1 Introduction......................................................................................................................... 365
26.2 Motivation............................................................................................................................ 365
26.3 State of the Art..................................................................................................................... 367
26.4 The Key-Node Method....................................................................................................... 368
26.5 Tools...................................................................................................................................... 369
26.5.1 Simulog..................................................................................................................... 369
26.5.2 Waikato Environment for Knowledge Analysis................................................ 369
26.5.3 Author Assistant Tool............................................................................................. 370
26.6 Applying the Key-Node Method...................................................................................... 370
26.6.1 Data Description..................................................................................................... 370
26.6.2 First Example........................................................................................................... 371
26.6.3 Second Example...................................................................................................... 372
26.7 Conclusions.......................................................................................................................... 374
Acknowledgment......................................................................................................................... 374
References...................................................................................................................................... 375
26.1╇ Introduction
Adaptive educational hypermedia systems (AEHS) seek to make easier the learning pro-
cess for each student by providing each one (potentially) different educative contents, cus-
tomized according to the student’s needs and preferences. One of the main concerns with
AEHS is to test and decide whether adaptation strategies are beneficial for all the students
or, on the contrary, some of them would benefit from different decisions of the adaptation
engine. Data-mining (DM) techniques can provide support to deal with this issue; specifi-
cally, this chapter proposes the use of DM techniques for detecting potential problems of
adaptation in AEHS.
26.2╇ Motivation
Whenever possible, learning systems should consider individual differences among stu-
dents. Students can have different interests, goals, previous knowledge, cultural back-
ground or learning styles, among other personal features. If these features are taken
365
366 Handbook of Educational Data Mining
into account, it is possible to make the learning process easier or more efficient for each
individual
� student. In this sense, AEHS [1] provide a platform for delivering educative
material and activities through the web. They automatically guide students through the
learning, by recommending the most suitable learning activities at every moment, accord-
ing to their personal features and needs. AEHS have been successfully used in differ-
ent contexts, and many online educational systems have been developed (e.g., AHA! [2],
TANGOW [3], WHURLE [4], and more recently NavEx [5], QuizGuide [6] and CoMoLE [7],
among others).
Most of the AEHS proposed had been tested and evaluated against nonadaptive
�counterparts, and many of them have shown important benefits for the students. Moreover,
systems supporting e-learning are commonplace nowadays. However, AEHS have not
been used in real educational environments as much as its potential and effectiveness
may suggest. The main obstacle to a wider adoption of AEH technology is the difficulty in
creating and testing adaptive courses. One of the main problems is that teachers need to
analyze how adaptation is working for different student profiles. In most AEHS, teachers
define rather small knowledge modules and rules to relate these modules, and the system
selects and organizes the material to be presented to every student depending on the stu-
dent profile. As a result of this dynamic organization of educational resources, the teacher
cannot look at the “big picture” of the course structure easily, since it can potentially be
different for each student and many times it also depends on the actions taken by the
student at runtime. In this sense, teachers would benefit from methods and tools specially
designed to support development and evaluation of adaptive systems.
Due to their own nature, AEHS collect records with the actions done by every student
while interacting with the adaptive course. Log files provide good opportunities for apply-
ing web usage mining techniques with the goal of providing a better understanding on
the student behavior and needs, and also how the adaptive course is fulfilling these needs.
With this intention, our effort is centered on helping authors to improve courses and we
propose a spiral model for the life cycle of an adaptive course:
• The first step in this cycle is for the instructor (or educative content designer)
to develop a course with an authoring tool and to load it in a course delivering
system.
• The course is delivered to the students, collecting their interaction with the system
(log files).
• Afterward, the instructor can examine the log files with the aid of DM tools. These
tools help the instructor to detect possible failures or weak points of the course
and even propose suggestions for improving the course.
• The instructor can follow these suggestions and make the corresponding modifi-
cations to the course through the authoring tool and load the course in the deliver-
ing system again, so that new students can benefit from the changes.
Therefore, the instructor can improve the course on each cycle. However, using DM tools
to analyze the interaction data and interpreting the results, even if the tools are available,
can be a daunting task for nonexpert users. For this reason, methods that help the instruc-
tors and course designers to analyze the data need to be developed.
The proposed method (key-node method) consists of using decision trees to assist in
the development of AEH courses, particularly on the evaluation and improvement phase.
When analyzing the behavior of a number of students using an AEH system, the author
Using Decision Trees for Improving AEH Courses 367
not only needs to find “weak points” of the course, but also needs to consider how these
potential problems are related with the student profiles. For example, finding that 20% of
the students failed a given exercise is not the same as knowing that more that 80% of the
students with profile “English,” “novice” failed it. In this case, the goal of our approach is
not only to extract information about the percentage of students that failed the exercise but
also to describe the features the students who failed have in common.
In order to show a practical use of the method, synthetic user data are analyzed. These
data are generated by Simulog [8], a tool able to simulate student behavior by generating
log files according to specified profiles. It is even possible to define certain problems of the
adaptation process that logs would reflect. In that way, it is possible to test this approach,
showing how the method will support teachers when dealing with student data.
of an AEH course [13]. This information is used to improve and personalize AEH courses
�(recommending the next links). Further information can be found in a very complete sur-
vey developed by Romero and Ventura [14]; it provides a good review of the main works
(from 1995 to 2005) using DM techniques grouped by task, in e-learning environments,
both for adaptive and nonadaptive systems.
• Step 1: Cleaning phase. Select the records in which the type of activity is either
practical activity or test. It is important that all entries must contain an indicator of
success or failure of each activity.
• Step 2: Apply the C4.5 algorithm with the following parameters:
• Attributes: Dimensions of student model and name of activity variable.
• Variable of classification: Indicator of success variable. This indicator shows if a
student passes a given practical activity or test. Two values are possible for this
variable: yes or no. A value yes indicates that the student score is higher than
the minimum required (specified by the teacher). Otherwise its value is no.
Using Decision Trees for Improving AEH Courses 369
• Step 3: Evaluate the results. The resulting decision tree contains nodes for each
attribute. In other words, the tree could be composed of nodes related to dimen-
sions of student profile, and one node related to name of activity variable. The leaves
of the tree contain the values of the variable of classification, indicator of success.
The next steps are necessary to find the symptoms:
• Select the leaves in which indicator of success variable has value no. Only these
leaves are important because they indicate that many students failed a given
activity.
• Analyze each path from the previous selected leaves to the root of the tree. For
each path, two steps are necessary:
−â‹™ Find in the path the node with the name of activity and store it. The prob-
lems in the adaptation should be closely related to this activity.
−â‹™ Find in the path the values of the student profile.
26.5╇ Tools
This section provides a short description of the tools utilized to apply the key-node
method. In addition, a description of ASquare (Author Assistant Tool) is provided.
26.5.1╇ Simulog
Simulog is a tool that simulates log files of several student profiles, which contain symp-
toms of bad adaptation [8]. A symptom of bad adaptation is, for example, most of the stu-
dents with profile “noviceâ•›=â•›experience” fails a given practical activity. The first step in
Simulog is to load the course description. Simulog is able to extract the parameters of the
adaptation from the course description. These parameters generally include the types of
student profiles. Afterward, the types of student profiles and the percentage of these pro-
files, the number of students to be generated, the average time that a student spent in an
activity, and the symptom of bad adaptation can be specified. Simulog reads the course
description and, based on a randomly generated student profile, reproduces the steps that
a student with this profile would take in the adaptive course.
Table 26.1
Dimensions for the Student Profile
Dimension Values
Language Spanish (35%), English (32.5%), German (32.5%)
Experience Novice (50%), Advanced (50%)
Age Young (50%), Old (50%)
Using Decision Trees for Improving AEH Courses 371
Table 26.2
Description of the Attributes of an Entry of Log File in TANGOW
Attributes Description
Activity Activity id
Complete It represents the level of completeness of the activity. If the activity is composed, it
takes into consideration the completeness of all subactivities. It is a numeric
parameter that ranges from 0 to 1. Value 0 indicates the activity was not completed
and value 1 indicates the activity was fully completed.
Grade The grade given to each activity. In practical (P) activities it is usually calculated from
a formula provided by the teacher. In composed activities it is the arithmetic mean of
subactivity grades. Value 1 indicates the activity was finished with success and value
0 indicates the activity was finished with failures.
NumVisit Number of times the student has visited the activity.
Action The action executed by the student; these are defined by the TANGOW system:
“START-SESSION”: beginning of the learning session.
“FIRSTVISIT”: first time an activity is visited.
“REVISIT”: any visit to the activity following the first one.
“LEAVE-COMPOSITE”: the student leaves the (composed) activity.
“LEAVE-ATOMIC”: the student leaves the (atomic) activity.
ActivityType The type of activity: theoretical (T), exercises (P), and examples (E).
SyntheticTime Time stamp generated by Simulog when the student starts interacting with the activity.
Success Indicates whether the activity is considered successful (yes) or not (no).
These two entries show that the student s100 with profile {“Spanish”; “novice”;“young”}
visited the “exercises of signs of traffic policeman” activity (S_Ag_Exer). It has 0.0 for com-
plete, 0.0 for grade and this is the first visit to this activity. The type “P” means that this
is a practical activity. The second entry shows that the student left this activity without
completing it (completeâ•›=â•›0.0) and having an insufficient score to pass the exercise; for this
reason, success is set to no.
This symptom of bad adaptation represents that 70% of students with “languageâ•›=â•›Spanish,”
“experienceâ•›=â•›novice,” and “ageâ•›=â•›young” fail the S_Ag_Exer activity. That is to say, stu-
dents with profile {“Spanish”; “novice”; “young”} fail the activity “exercises of signs of
traffic policeman” with 70% probability.
According to the previous method, the first step (Cleaning phase) was to clean the data.
It consists of removing from log file the records that are not necessary for the mining
372 Handbook of Educational Data Mining
Activity
=Spanish
=English =German
=Novice =Advanced
=Young =Old
Figure 26.1
Decision tree for the first example.
phase. Cleaning in this case is both important and necessary for the size of data as a whole
and, consequently, for the speed and accuracy with which results are obtained. Only prac-
tical activities are considered in the analysis, since these activities provide a reliable grade.
The practical activities are denoted in the log file with action equals to “LEAVE-ATOMIC”
and type equals to “P”. As our intention is to analyze the practical activities, the records
with action different from “LEAVE-ATOMIC” and type of activity different from “P” (test
or practical activities) were eliminated. As a result, the final set of records contained 960
records. The second step (Apply the C4.5 algorithm) is to generate the decision tree. Figure
26.1 shows the obtained decision tree. This tree is composed of three nodes related to
dimensions of student profile: nodes language, experience, and age; and one node related
to the name of activity: node activity. The last step of the method (Evaluate the results) is
to find the node activity and the profile, and it is described as follows:
• In the tree, only one leaf is found with the value no. This leaf has 77% of well-
classified instances. The value of node activity for this leaf is S_Ag_Exer. Then,
the student profile is formed by “ageâ•›=â•›young,” “experienceâ•›=â•›novice,” and
“languageâ•›=â•›Spanish”.
Therefore, this tree indicates that a great number of the students who follow the Spanish
version of the course, who have novice experience, and who are young had many failures
in the S_Ag_Exer activity. It is important to highlight that in this example the tree has a
high percentage of well-classified instances. This fact is due to the absence of randomness
effect in variable grade when a student is not related to the symptom of bad adaptation. In
this case, these students always pass the activity.
Age
=Old =Young
Yes (129/34) Experience Yes (25/9) Experience Yes (89/27) Yes (25/8)
Figure 26.2
Decision tree for the second example.
Therefore, in this example, there are two sources of noise, which are the number of symp-
toms and the randomness effect. These symptoms were defined as 60% of students with
profile {“Spanish”; “novice”; “young”} fail the S_Ag_Exer (exercises of signs of traffic police-
man), and 60% of students with profile {“English”; “novice”; “young”} fail the S_Circ_Exer
(exercises of circular signs) activity. The first phase of the key-node method is to proceed
to clean the data (cleaning phase) as it is realized in the first example. The results of the
cleaning phase showed 1920 records to which the algorithm of decision tree is applied in
the second step (Figure 26.2 shows the decision tree). The last step of the method obtained
the following outcomes:
• Two leaves with the value no are found in the tree. Two activities are related to
these leaves: S_Ag_Exer and S_Circ_Exer. Therefore, two possible symptoms of
bad adaptation can be found.
• For the first leaf no (related to the node “activityâ•›=â•›S_Ag_Exer”), the student pro-
file is defined by the variables “experienceâ•›=â•›novice,” “languageâ•›=â•›Spanish,” and
“ageâ•›=â•›young.”
• For the second leaf with no value (related to the node “activityâ•›=â•›S_Circ_Exer”),
the student profile is defined by the variables “experienceâ•›=â•›novice,” “lan-
guageâ•›=â•›English,” and “ageâ•›=â•›young”.
Thus, two symptoms of bad adaptation are detected, since the proportion of well-classified
instances is reasonably high in both leaves with no value (more than 70%). Hence, the
young students with novice experience who follow the Spanish version of the course had
many difficulties with S_Ag_Exer activity. In addition, there was another group of young
students with novice experience with many difficulties in S_Circ_Exer activity, but the
language in this group was English.
374 Handbook of Educational Data Mining
26.7╇ Conclusions
This work proposes a practical way, based on decision trees, to search for possible wrong
adaptation decisions on AEHS. The decision tree technique is a useful method for detect-
ing patterns related to symptoms of potential problems on the adaptation procedure.
This chapter presents two experiments intended to show the advantages of this method.
They were carried out with different numbers of simulated students and also with differ-
ent percentages of students failing the same exercise, all of them corresponding to a certain
profile. The first experiment proves the effectiveness of decision trees for detecting existing
symptoms of bad adaptation without noise in the data. The second experiment was carried
out with a larger number of students. Moreover, noise was included in the data through
a randomness factor in the grade variable. It was added with the objective of generating
data to be closer to reality. This experiment shows the algorithm scalability and reliability.
Furthermore, the method for detecting symptoms provides instructors with two types of
information. On one hand, the instructor can know whether a symptom is closely related
to a given activity. Then, the instructor can decide to check the activity and the adaptation
around it. On the other hand, the instructor can detect whether a group of students belong-
ing to a certain user profile (or sharing certain features) had trouble with an activity. Then,
the instructor can decide either to modify the activity itself, to include additional activities
to reinforce the corresponding learning, to establish previous requirements to tackle the
activity or to change the course structure, that is, for students matching this learning profile,
by incorporating rules to represent the corresponding adaptation for this type of students.
The usefulness of this method for detecting potential problems in AEH courses has been
shown in the two examples. However, to be useful for instructors, this method ought to
be supported by tools, which hide the technique details to nonexpert users in DM. In that
sense, we are improving our evaluation tool, ASquare (Author Assistant), by adding the
key-node method [18].
It is important to highlight that the utility of decision trees showed in this chapter is not
centered on the accuracy in predicting the success of students when tackling learning activ-
ities. Therefore, the percentage of well-classified events is less important than the capability
of this tree to show the symptoms of bad adaptation. Finally, the two examples showed
that, although decision trees are a powerful technique, they also have weak points. An
important weakness is that the information extracted may not always be complete, since
the algorithm C4.5 works with probabilities of events. Therefore, for complementing the
information extracted it may be necessary to use this method together with other DM tech-
niques such as association rules, clustering, or other multivariable statistical techniques. In
that sense, our future work is centered on testing the combination of decision trees with
other techniques for completing the information extracted from those. In this sense, we
analyzed in other work if association rules provide additional information to decision trees
[19]. Another important challenge is to know the threshold index of failures that indicates a
symptom of bad adaptation. This last challenge is also part of our future work.
Acknowledgment
The work presented in this chapter has been partially funded by the Spanish Ministry of
Science and Education through project HADA (TIN2007-64716).
Using Decision Trees for Improving AEH Courses 375
References
1. Brusilovsky, P. 2003. Developing adaptive educational hypermedia systems: From design
models to authoring tools. In Authoring Tools for Advanced Technology Learning Environment, ed.
T. Murray, S. Blessing, and S. Ainsworth, pp. 377–409. Dordrecht, the Netherlands: Kluwer
Academic Publishers.
2. De Bra, P., Aerts, A., Berden, B. et al. 2003. AHA! The Adaptive Hypermedia Architecture. In
Proceedings of 14th ACM Conference on Hypertext and Hypermedia, pp. 81–84. Nottingham, U.K.:
ACM Press.
3. Carro, R.M., Pulido, E., and Rodríguez, P. 1999. Dynamic generation of adaptive internet-based
courses. Journal of Network and Computer Applications 22:249–257.
4. Moore, A., Brailsford, T.J., and Stewart, C.D. 2001. Personally tailored teaching in WHURLE
using conditional transclusion. In Proceedings of the 12th ACM Conference on Hypertext and
Hypermedia, pp. 163–164. Odense, Denmark: ACM Press.
5. Yudelson, M. and Brusilovsky, P. 2005. NavEx: Providing navigation support for adaptive
browsing of annotated code examples. In Proceedings of 12th International Conference on Artificial
Intelligence in Education (AIED), ed. C.K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, pp.
710–717. Amsterdam, the Netherlands: IOS Press.
6. Sosnovsky, S. and Brusilovsky, P. 2005. Layered evaluation of topic-based adaptation to stu-
dent knowledge. In Proceedings of 4th Workshop on the Evaluation of Adaptive Systems at 10th
International User Modeling Conference, ed. S. Weibelzahl, A. Paramythis, and J. Masthoff, pp.
47–56, Edinburgh, U.K.
7. Martín, E., Carro, R.M., and Rodríguez, P. 2006. A mechanism to support context-based adap-
tation in m-learning. Innovative Approaches for Learning and Knowledge Sharing, Lecture Notes in
Computer Science 4227:302–315.
8. Bravo, J. and Ortigosa, A. 2006. Validating the evaluation of adaptive systems by user profile
simulation. In Proceedings of Fifth Workshop on User-Centred Design and Evaluation of Adaptive
Systems held at the Fourth International Conference on Adaptive Hypermedia and Adaptive Web-Based
Systems (AH2006), ed. S. Weibelzahl and A. Cristea, pp. 479–483, Dublin, Ireland.
9. Becker, K., Marquardt, C.G., and Ruiz, D.D. 2004. A pre-processing tool for web usage min-
ing in the distance education domain. In Proceedings of the International Database Engineering
and Application Symposium (IDEAS’04), ed. J. Bernardino and B.C. Desai, pp. 78–87. Coimbra,
Portugal: IEEE.
10. Merceron, A. and Yacef, K. 2005. Educational data mining: A case study. In Proceedings of the
12th International Conference on Artificial Intelligence in Education (AIED), ed. C. Looi, G. McCalla,
B. Bredeweg, and J. Breuker, pp. 467–474. Amsterdam, the Netherlands: IOS Press.
11. Merceron, A. and Yacef, K. 2007. Revisiting interestingness of strong symmetric association
rules in educational data. In Proceedings of International Workshop on Applying Data Mining
in E-Learning (ADML07) held at the 2nd European Conference on Technology Enhanced Learning
(EC-TEL 2007), pp. 3–12, Crete, Greece.
12. Cheong, M-H., Meyer, B., and Mannock, K.L. 2006. Understanding novice errors and error paths
in object-oriented programming through log analysis. In Proceedings of Workshop on Educational
Data Mining at the 8th International Conference on Intelligent Tutoring Systems (ITS 2006), ed. C.
Heiner, R. Baker, and K. Yacef, pp. 13–20, Jhongli, Taiwan.
13. Romero, C., Porras, A.R., Ventura, S., Hervás, C., and Zafra, A. 2006. Using sequential pat-
tern mining for links recommendation in adaptive hypermedia educational systems. Current
Developments in Technology Assisted Education 2:1016–1020.
14. Romero, C. and Ventura, S. 2007. Educational data mining: A survey from 1995 to 2005. Expert
Systems with Applications 33(1):135–146.
15. Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann
Publishers.
16. Mitchell, T. 1997. Decision Tree Learning. New York: McGraw Hill.
376 Handbook of Educational Data Mining
17. Witten, I.H. and Frank, E. 2005. Data Mining Practical Machine Learning Tools and Techniques. San
Francisco, CA: Morgan Kaufmann Publishers.
18. Bravo, J., Vialardi, C., and Ortigosa, A. 2008. ASquare: A powerful evaluation tool for adap-
tive hypermedia course system. In Proceedings of Hypertext Conference, ed. P. Brusilovsky,
pp. 219–220. Pittsburgh, PA: Sheridan Printing.
19. Vialardi, C., Bravo, J., and Ortigosa, A. 2009. Improving AEH courses through log analysis.
Journal of Universal Computer Science (J.UCS) 14(17):2777–2798.
27
Validation Issues in Educational Data Mining:
The Case of HTML-Tutor and iHelp
Contents
27.1 Introduction......................................................................................................................... 377
27.2 Validation in the Context of EDM.................................................................................... 378
27.3 Disengagement Detection Validation: A Case Study.................................................... 378
27.3.1 Detection of Motivational Aspects in e-Learning.............................................. 378
27.3.2 Proposed Approach to Disengagement Detection............................................. 379
27.3.3 Disengagement Detection Validation.................................................................. 379
27.3.3.1 Data Considerations................................................................................ 379
27.3.3.2 Annotation of the Level of Engagement............................................... 380
27.3.3.3 Analysis and Results............................................................................... 381
27.3.3.4 Cross-System Results Comparison........................................................ 384
27.4 Challenges and Lessons Learned..................................................................................... 385
27.5 Conclusions.......................................................................................................................... 386
References...................................................................................................................................... 386
27.1╇ Introduction
Validation is one of the key aspects in data mining and even more so in educational data
mining (EDM) owing to the nature of the data. In this chapter, a brief overview of vali-
dation in the context of EDM is given and a case study is presented. The field of the
case study is related to motivational issues, in general, and disengagement detection,
in particular. There are several approaches to eliciting motivational knowledge from a
learner’s activity trace; in this chapter the validation of such an approach is presented
and discussed.
The chapter is structured as follows. Section 27.2 provides an overview of validation
in the context of EDM. Section 27.3 presents the case study, including previous work on
motivation in e-Learning, details of data and methods, and results. Section 27.4 presents
some challenges encountered and lessons learned and, finally, Section 27.5 concludes the
chapter.
377
378 Handbook of Educational Data Mining
Third, engagement tracing [6] is an approach based on Item Response Theory that pro-
poses the estimation of the probability of a correct response given a specific response
time for modeling disengagement; two methods of generating responses are assumed: (1)
“blind guess” when the student is disengaged, and (2) an answer with a certain probability
of being correct when the student is engaged. The model also takes into account individual
differences in reading speed and level of knowledge.
Fourth, a dynamic mixture model combining a hidden Markov model with Item
Response Theory was proposed in [13]. The dynamic mixture model takes into account
student proficiency, motivation, evidence of motivation, and the student’s response to a
problem. The motivation variable can have three values: (1) motivated; (2) unmotivated
and exhausting all the hints in order to reach the final one that gives the correct answer,
categorized as unmotivated-hint; and (3) unmotivated and quickly guessing answers to
find the correct answer, categorized as unmotivated-guess.
Fifth, a Bayesian network has been developed [2] from log-data in order to infer variables
related to learning and attitudes toward the tutor and the system. The log-data registered
variables such as problem-solving time, mistakes, and help requests.
Last, a latent response model [4] was proposed for identifying the students that game the
system. Using a pretest–posttest approach, the gaming behavior was classified in two cate-
gories: (1) with no impact on learning and (2) with decrease in learning gain. The variables
used in the model were student’s actions and probabilistic information about the student’s
prior skills. The same problem of gaming behavior was addressed in [23], an approach that
combines classroom observations with logged actions in order to detect gaming behavior
manifested by guessing and checking or hint/help abuse.
Table 27.1
The Attributes Used for Analysis
Codes Attributes
NoPages Number of pages read
AvgTimeP Average time spent reading
NoQuestions Number of questions from quizzes/surveys
AvgTimeQ Average time spent on quizzes/surveys
Total time Total time of a sequence
NoPpP Number of pages above the threshold established
for maximum time required to read a page
NoPM Number of pages below the threshold established
for minimum time to read a page
applications, the iHelp Discussion System and iHelp Learning Content Management
System, designed to support both learners and instructors throughout the learning pro-
cess. The latter is designed to deliver online courses to students working at a distance, pro-
viding course content (text and multimedia) as well as quizzes and surveys. The students’
interactions with the system are preserved in a machine readable format.
The same type of data about the interactions was selected from the logged informa-
tion to perform the same type of analysis as the one performed on HTML-Tutor data. An
HTML course was also chosen to prevent differences in results caused by differences in
subject matter. Data from 11 students was used, meaning a total of 108 sessions and 450
sequences (341 of exactly 10â•›min and 109 less than 10â•›min). While at first glance a sample
size of 11 students may seem rather small, it should be noted that the total time observed
(i.e., more than 60â•›h of learning) as well as the number of instances analyzed (i.e., 450
sequences) is far more important for the validity of the results.
Several attributes (displayed in Table 27.1) related to reading pages and quizzes were
used in the analysis. The terms tests and quizzes will be used interchangeably; they refer
to the same type of problem-solving activity, except that in HTML-Tutor they are called
tests and in iHelp they are named quizzes. Total time (of a sequence) was included as
an attribute for the trials that took into account sequences of less than 10â•›min as well as
sequences of exactly 10â•›min. Compared to the analysis of HTML-Tutor logs, for iHelp,
there are fewer attributes related to quizzes: information about the number of questions
attempted and about the time spent on them is included, but information about the cor-
rectness or incorrectness of answers given by users was not available at the time of the
analysis. Two new meta-attributes that were not considered for HTML-Tutor were intro-
duced for this analysis: the number of pages above and below a certain time threshold,
described in the subsequent section; they are meta-attributes because they are not among
the raw data, but they are derived from it.
These rules were applied after having obtained the expert annotations and as a result of
a common pattern observed for both HTML-Tutor and iHelp. Consequently, the two new
meta-attributes were added to investigate their contribution to prediction and their poten-
tial usage for a less time-consuming process for annotation.
Initially, we intended to use the average time spent on each page across all users, as sug-
gested by [19], but analyzing the data, we have seen that some pages are accessed by a very
small number of users, sometimes only one; this problem was also encountered in other
research (e.g., [10]). Consequently, we decided to use the average reading speed known
to be in between 200 and 250 words per min [20,21]. Out of the 652 pages accessed by the
students, five pages needed between 300 and 400â•›s to be read at average speed, 41 pages
needed between 200 and 300â•›s, 145 needed between 100 and 300â•›s, and 291 needed less than
100â•›s. Some pages included images and videos; however, only two students attempted to
watch videos, one giving up after 3.47â•›s and the other one watching a video (or being on
the page with the link to a video) for 162â•›s (almost 3â•›min). Taking into account this infor-
mation, less than 5â•›s or more than 420â•›s (7â•›min) spent on a page were agreed to indicate
disengagement.
For the HTML-Tutor logs, the level of engagement was established by human experts
that looked at the log files and established the level of engagement for each sequence (of
10â•›min or less), in a manner similar to the analysis described by [9]. The same procedure
was applied for iHelp, plus the two aforementioned rules.
Accordingly, the level of engagement was determined for each sequence of 10â•›min or
less. If in a sequence the learner spent more than 7â•›min on a page or test, the learner was
considered disengaged during that sequence. In relation to pages accessed less than 5â•›s,
a user was considered disengaged if two-thirds of the total number of pages were below
that time.
With HTML-Tutor, the rating consistency was verified by measuring inter-coding reli-
ability. A sample of 100 sequences (from a total of 1015) was given to a second rater and
results indicated high inter-coder reliability: percentage agreement of 92%, Cohen’s kappa
measurement of agreement of 0.826 (pâ•›<â•›0.01), and Krippendorff’s alpha of 0.845 [15]. With
iHelp only one rater classified the level of engagement for all sequences.
Table 27.2
Datasets Used in the Experiment
Dataset Sequences Attributes
Dataset 1 All sequences NoPages, AvgTimeP, NoQuestions,
AvgTimeQ, Total time, NoPpP, NoPM
Dataset 2 All sequences NoPages, AvgTimeP, NoQuestions,
AvgTimeQ, Total time
Dataset 3 Only 10â•›min sequences NoPages, AvgTimeP, NoQuestions,
AvgTimeQ, Total time, NoPpP, NoPM
Dataset 4 Only 10â•›min sequences NoPages, AvgTimeP, NoQuestions,
AvgTimeQ, Total time
(8) Decision Trees (DT) with J48 classifier based on Quilan’s C4.5 algorithm. The experi-
ments were done using 10-fold stratified cross-validation iterated 10 times.
Results are displayed in Table 27.3, including accuracy and its standard deviation across
all trials, true positive (TP) rate for disengaged class, precision (TP/(TPâ•›+â•›false positive))
for disengaged class, mean absolute error, and kappa statistic. In our case, TP rate is more
important than precision because TP rate indicates the correct percentage from actual
instances of a class, and precision indicates the correct percentage from predicted instances
Table 27.3
Experiment Results Summary
Dataset Measure BN LR SL IBk ASC B CvR DT
Dataset 1 Accuracy 89.31 95.22 95.13 95.29 95.44 95.22 95.44 95.31
Std. Dev 4.93 2.78 2.82 2.98 2.97 3.12 3.00 3.03
TP rate 0.90 0.95 0.95 0.94 0.94 0.94 0.95 0.95
Precision 0.90 0.95 0.95 0.96 0.97 0.97 0.96 0.96
Error 0.13 0.07 0.10 0.05 0.08 0.08 0.08 0.07
Kappa 0.79 0.90 0.90 0.91 0.91 0.90 0.91 0.91
Dataset 2 Accuracy 81.73 83.82 83.58 84.00 84.38 85.11 85.33 84.38
Std. Dev 5.66 5.03 5.12 4.85 5.08 5.17 5.13 5.07
TP rate 0.78 0.82 0.81 0.79 0.77 0.79 0.80 0.78
Precision 0.86 0.86 0.86 0.89 0.91 0.91 0.91 0.91
Error 0.22 0.24 0.26 0.20 0.25 0.23 0.23 0.25
Kappa 0.64 0.68 0.67 0.68 0.69 0.70 0.71 0.69
Dataset 3 Accuracy 94.65 98.06 97.91 98.59 97.65 97.65 97.76 97.47
Std. Dev 4.47 2.18 2.69 2.11 2.64 2.64 2.65 2.58
TP rate 0.95 0.97 0.96 0.98 0.96 0.96 0.96 0.96
Precision 0.94 0.99 0.99 0.99 0.99 0.99 0.99 0.99
Error 0.07 0.02 0.04 0.02 0.05 0.04 0.03 0.03
Kappa 0.89 0.96 0.96 0.97 0.95 0.95 095 0.95
Dataset 4 Accuracy 84.29 85.82 85.47 84.91 84.97 85.38 85.26 85.24
Std. Dev. 5.77 5.90 5.88 5.95 5.61 5.80 5.96 5.91
TP rate 0.78 0.77 0.76 0.77 0.75 0.76 0.75 0.75
Precision 0.88 0.92 0.92 0.89 0.92 0.92 0.92 0.92
Error 0.18 0.22 0.23 0.20 0.25 0.23 0.24 0.24
Kappa 0.68 0.71 0.70 0.69 0.69 0.70 0.70 0.70
Validation Issues in Educational Data Mining: The Case of HTML-Tutor and iHelp 383
Table 27.4
The Confusion Matrix for Instance Based
Classification with IBk Algorithm
Predicted
Engaged Disengaged
Actual Engaged 180 1
Disengaged 4 155
in that class. In other words, the aim is to identify as many disengaged students as pos-
sible. If an engaged student is misdiagnosed as being disengaged and receives special
treatment for remotivation, this will cause less harm than the opposite situation.
The results presented in Table 27.3 show very good levels of prediction for all methods,
with accuracy varying between approximately 81% and 98%. There are similar results for
the disengaged class, the TP rate and the precision indicator for disengaged class varying
between 75% and 98%. The mean absolute error varies between 0.02 and 0.25; the kappa
statistic varies between 0.64 and 0.97, indicating that the results are much better than
chance. In line with the results for HTML-Tutor, the fact that very similar results were
obtained from different methods and trials demonstrates the consistency of the prediction
and of the attributes used for prediction. The results for Dataset 1 and 3 are better than the
ones from Dataset 2 and 4, suggesting that the two new meta-attributes bring significant
information gain.
The highest accuracy was obtained using Instance-based classification with IBk algo-
rithm on Dataset 3: 98.59%; the confusion matrix for this method is presented in Table 27.4.
For the disengaged TP rate, the same method performs best on the same dataset: 0.98.
Investigating further the information gain brought by the two meta-attributes, attribute
ranking using information gain ranking filter as attribute evaluator was performed and
the following ranking was found: NoPgP, AvgTimeP, NoPages, NoPgM, NoQuestions,
and AvgTimeQ. Hence, the meta-attributes seem to be more important than the attributes
related to quizzes. The information gain contributed by NoPgP is also reflected in the deci-
sion tree graph displayed in Figure 27.1, where NoPgP has the highest information gain,
being the root of the tree.
NoPgP
<= 0 >0
NoPgM d (153.0/1.0)
<= 6 >6
NoPages NoPgM
<= 2 >2 <= 15 > 15
AvgTimeQ e (154.0) e (8.0/1.0) d (2.0)
<= 512.98 > 512.98
NoPgM d (2.0)
<= 0 >0
e (18.0) d (3.0/1.0)
Figure 27.1
Decision tree graph for Dataset 3.
384 Handbook of Educational Data Mining
Table 27.5
Experiment Results Summary for HTML-Tutor
BN LR SL IBk ASC B CvR DT
Accuracy 87.07 86.52 87.33 85.62 87.24 87.41 87.64 86.58
TP rate 0.93 0.93 0.93 0.92 0.93 0.93 0.92 0.93
Precision 0.91 0.90 0.90 0.91 0.92 0.92 0.92 0.91
Error 0.10 0.12 0.12 0.10 0.10 0.12 0.12 0.11
Table 27.6
Similarities and Dissimilarities between iHelp and HTML-Tutor
Characteristic iHelp HTML-Tutor
Prediction based on reading 81%–85% with no information on 86%–87%
and tests attributes correctness/incorrectness of
quizzes and no additional attributes
85%–98% with the two additional
attributes
Attribute ranking Number of pages above a threshold Average time spent on pages
Average time spent reading Number of pages
Number of pages read/accessed Number of tests
Number of pages below a threshold Average time spent on tests
Number of questions from quizzes Number of correctly answered tests
Average time spent on quizzes Number of incorrectly answered tests
Validation Issues in Educational Data Mining: The Case of HTML-Tutor and iHelp 385
Even with the mentioned differences, the fact that a good level of prediction was
obtained from similar attributes on datasets from different systems using the same meth-
ods indicates that engagement prediction is possible using information related to reading
pages and problem-solving activities, information logged by most e-Learning system.
Therefore, our proposed approach for engagement prediction is potentially system inde-
pendent and could be generalized for any web-based system that includes both types of
activities.
relevant attributes and methods, while the second one involves the practical, implementa-
tion issues. For example, when developing an approach, the use of several methods serves
the purpose of inspecting the consistency of results, while in practice it is best to work with
one method.
27.5╇ Conclusions
In this chapter, issues related to validation in EDM were presented and discussed in the
context of a case study on disengagement detection. The proposed approach for disen-
gagement detection is simple and needs information about actions related to reading and
problem-solving activities, which are logged by most e-Learning systems. Because of these
characteristics, we believe that this approach can be generalized to other systems, as illus-
trated in the validation study presented in this chapter. The similarity of results across
different data mining methods is also an indicator of the consistency of our approach and
of the attributes used.
References
1. Anozie N. and Junker, B.W., Predicting end-of-year accountability assessment scores from
monthly student records in an online tutoring system. In Beck, J., Aimeur, E., and T. Barnes
(eds.), EDM: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press, pp. 1–6. Technical
Report WS-06-05, 2006.
2. Arroyo, I. and Woolf, B.P., Inferring learning and attitudes from a Bayesian network of log
file data. In Proceedings of the 12th International Conference on Artificial Intelligence in Education,
Amsterdam, the Netherlands, pp. 33–34, 2005.
3. Baker, R.S.J.D. and Carvalho, A.M.J.A.D., Labeling student behavior faster and more pre-
cisely with text replays. In Proceedings of 1st International Conference on Educational Data Mining,
Montreal, Canada, pp. 38–47, 2008.
4. Baker, R.S.J.D., Corbett, A.T., Roll, I., and Koedinger, K.R., Developing a generalizable detector
of when students game the system. User Modeling and User-Adapted Interaction 18(3), 287–314,
2008.
5. Beck, J. E. and Sison, J., Using knowledge tracing in a noisy environment to measure student
reading proficiencies. International Journal of Artificial Intelligence in Education 16, 129–143, 2006.
6. Beck, J., Engagement tracing: Using response times to model student disengagement. In
Proceedings of the 12th International Conference on Artificial Intelligence in Education, Amsterdam,
the Netherlands, pp. 88–95, 2005.
7. Cocea, M. and Weibelzahl, S., Can log files analysis estimate learners’ level of motivation? In
Proceedings of ABIS Workshop, ABIS 2006—14th Workshop on Adaptivity and User Modeling in
Interactive Systems, Hildesheim, Germany, pp. 32–35, 2006.
8. Cocea, M. and Weibelzahl, S., Eliciting motivation knowledge from log files towards moti-
vation diagnosis for adaptive systems. In Proceedings of 11th International Conference on User
Modeling, Corfu, Greece, pp. 197–206, 2007.
9. De Vicente, A. and Pain, H., Informing the detection of the students’ motivational state: An
empirical study. In Proceedings of the 6th International Conference on Intelligent Tutoring Systems,
Biarritz, France and San Sebastian, Spain, pp. 933–943, 2002.
Validation Issues in Educational Data Mining: The Case of HTML-Tutor and iHelp 387
10. Farzan, R. and Brusilovsky, P., Social navigation support in e-Learning: What are real footprints.
In Proceedings of IJCAI’05 Workshop on Intelligent Techniques for Web Personalization, Edinburg,
Scotland, pp. 49–56, 2005.
11. Feng, M., Beck J., Hefferman, N., and Koedinger, K., Can an intelligent system predict math
proficiency as well as a standardized test? In Proceedings of 1st International Conference on
Educational Data Mining, Montreal, Canada, pp. 107–116, 2008.
12. Feng, M., Heffernan, N.T., and Koedinger, K., Addressing the testing challenge with a web-
based e-assessment system that tutors as it assesses. In Proceedings of the 15th International World
Wide Web Conference, Edinburg, Scotland, pp. 307–316, 2006.
13. Johns, J. and Woolf, B., A dynamic mixture model to detect student motivation and proficiency.
In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), Boston, MA,
pp. 163–168, 2006.
14. Keller, J.M., Development and use of the ARCS model of instructional design. Journal of
Instructional Development 10(3), 2–10, 2007.
15. Lombard, M., Snyder-Duch, J., and Campanella Bracken, C., Practical Resources for Assessing
and Reporting Intercoder Reliability in Content Analysis Research, 2003. http://www.temple.edu/
mmc/reliability (accessed November 6, 2006).
16. Mavrikis, M., Data-driven modelling of students’ interactions in an ILE. In Proceedings of 1st
International Conference on Educational Data Mining, Montreal, Canada, pp. 87–96, 2008.
17. Mitchell, T.M., Machine Learning. McGraw-Hill, New York, 1997.
18. Qu, L., Wang, N., and Johnson, W.L., Detecting the learner’s motivational states in an interactive
learning environment. In Proceedings of the 12th International Conference on Artificial Intelligence in
Education, Amsterdam, the Netherlands, pp. 547–554, 2005.
19. Rafter, R. and Smyth, B., Passive profiling from server logs in an online recruitment environ-
ment. In Proceedings of the IJCAI Workshop on Intelligent Techniques for Web Personalization, Seattle,
Washington, pp. 35–41, 2001.
20. ReadingSoft.com found at HYPERLINK, http://www.readingsoft.com
21. TurboRead Speed Reading found at HYPERLINK, http://www.turboread.com
22. Ventura, S., Romero, C., and Hervas, C., Analysing rule evaluation measures with educa-
tional datasets: A framework to help the teacher. In Proceedings of 1st International Conference on
Educational Data Mining, Montreal, Canada, pp. 177–186, 2008.
23. Walonoski, J. and Heffernan, N.T., Detection and analysis of off-task gaming behavior in intel-
ligent tutoring systems. In Proceedings of the 8th International Conference in Intelligent Tutoring
Systems, Jhongli, Taiwan, pp. 382–391, 2006.
24. Witten, I.H. and Frank, E., Data Mining. Practical Machine Learning Tools and Techniques, 2nd edn.,
Morgan Kauffman Publishers, Elsevier, Amsterdam, the Netherlands, 2005.
25. Zhang, X., Mostow, J., and Beck, J.E., A case study empirical comparison of three methods to
evaluate tutorial behaviors. In Proceedings of the 9th International Conference on Intelligent Tutoring
System, LNCS, Vol. 5091, Montreal, Canada, pp. 122–131, 2008.
26. Zhang, X., Mostow, J., Duke, N., Trotochaud, C., Valeri, J., and Corbett, A., Mining free-form
spoken responses to tutor prompts. In Proceedings of 1st International Conference on Educational
Data Mining, Montreal, Canada, pp. 234–241, 2008.
28
Lessons from Project LISTEN’s Session Browser
Contents
28.1 Introduction......................................................................................................................... 390
28.1.1 Relation to Prior Research..................................................................................... 392
28.1.2 Guidelines for Logging Tutorial Interactions..................................................... 393
28.1.2.1 Log Tutor Data Directly to a Database................................................ 394
28.1.2.2 Design Databases to Support Aggregation across Sites................... 394
28.1.2.3 Log Each School Year’s Data to a Different Database....................... 394
28.1.2.4 Include Computer, Student ID, and Start Time as
Standard Fields................................................................................... 395
28.1.2.5 Log End Time as well as Start Time.................................................... 395
28.1.2.6 Name Standard Fields Consistently Within and Across
Databases.............................................................................................. 395
28.1.2.7 Use a Separate Table for Each Type of Tutorial Event....................... 396
28.1.2.8 Index Event Tables by Computer, Student ID, and Start Time......... 396
28.1.2.9 Include a Field for the Parent Event Start Time................................. 396
28.1.2.10 Logging the Nonoccurrence of an Event Is Tricky............................ 396
28.1.3 Requirements for Browsing Tutorial Interactions.............................................. 397
28.2 Specify a Phenomenon to Explore.................................................................................... 397
28.2.1 Specify Events by When They Occurred............................................................. 397
28.2.2 Specify Events by a Database Query................................................................... 398
28.2.3 Specify Events by Their Similarity to Another Event.......................................400
28.3 Display Selected Events with the Context in Which They Occurred, in
Adjustable Detail................................................................................................................ 401
28.3.1 Temporal Relations among Events....................................................................... 402
28.3.1.1 The Ancestors of a Descendant Constitute Its Context..................... 402
28.3.1.2 Parents, Children, and Equals.............................................................. 402
28.3.1.3 Siblings..................................................................................................... 402
28.3.1.4 Duration and Hiatus.............................................................................. 402
28.3.1.5 Overlapping Events................................................................................ 403
28.3.2 Displaying the Event Tree..................................................................................... 403
28.3.2.1 Computing the Event Tree....................................................................404
28.3.2.2 Expanding the Event Tree.....................................................................404
28.4 Summarize Events in Human-Understandable Form................................................... 405
28.4.1 Temporal Information............................................................................................405
28.4.2 Event Summaries.................................................................................................... 406
28.4.3 Audio Recordings and Transcription.................................................................. 406
28.4.4 Annotations............................................................................................................. 406
389
390 Handbook of Educational Data Mining
28.1╇ Introduction
Intelligent tutoring systems’ ability to log their interactions with students poses both an
opportunity and a challenge. Compared to human observation of live or videotaped tutor-
ing, such logs can be more extensive in the number of students, more comprehensive in the
number of sessions, and more exquisite in the level of detail. They avoid observer effects,
cost less, and are easier to analyze. The resulting data is a potential gold mine [4]—but min-
ing it requires the right tools to locate promising areas, obtain samples, and analyze them.
For example, consider the data logged by Project LISTEN’s Reading Tutor [5,6], which
listens to children read aloud, and helps them learn to read [7–10]. As Figures 28.1 and
28.2 illustrate, each session involves taking turns with the Reading Tutor to pick a story
to read with its assistance. The Reading Tutor displays the story one sentence at a time,
and records the child’s utterances for each sentence. The Reading Tutor logs each event
(session, story, sentence, utterance, …) into a database table for that event type. Data from
tutors at different schools flow overnight into an aggregated database on our server. For
example, our 2003–2004 database includes 54,138 sessions, 162,031 story readings, 1,634,660
sentences, 3,555,487 utterances, and 10,575,571 words. This data is potentially very infor-
mative, but orders of magnitude larger than is feasible to peruse in its entirety.
We view educational data mining as an iterative cycle of hypothesis formation, test-
ing, and refinement. This cycle includes two complementary types of activities. One type
Lessons from Project LISTEN’s Session Browser 391
FIGURE 28.1
Picking a story in Project LISTEN’s Reading Tutor.
FIGURE 28.2
Assisted reading in Project LISTEN’s Reading Tutor.
and incorrect answers to individual test items, or (if available) what they wrote on the test
paper in the course of working them out. For data logged by an intelligent tutoring sys-
tem detailing its interactions with students, such analysis might try to identify significant
sequences of tutorial events. In both cases, the research question addressed is descriptive
[17, p. 204]: “What happened when …?” Such case analyses serve several purposes; a few
examples are listed below:
• Spot-check tutoring sessions to discover undesirable tutor–student interactions.
• Identify and characterize typical cases in which a specified phenomenon occurs.
• Develop a query by refining it based on examples it retrieves.
• Formulate hypotheses by identifying features that examples suggest are relevant.
• Sanity-check a hypothesis by checking that it covers the intended sorts of examples.
TABLE 28.1
Activities in a Session
TABLE 28.2
Sentence Encounters in a Story [18]
shown in Table 28.2. This ability to generate a table on demand with a single click spared
the user the effort of writing the multiple database queries, including joins, required to
generate the table.
However, the viewer suffered from several limitations. It displayed a list of records as a
standard HTML table, which was not necessarily human-understandable. Although tables
can be useful for comparing events of the same type, they are ill-suited to conveying the
heterogeneous set of events that transpired during a given interaction, or the context in
which they occurred. Navigation was restricted to drilling down from the top-level list
of students or tutors, with no convenient way to specify a particular type of interaction
to explore, and no visible indication of context. Finally, the viewer was inflexible. It was
specific not just to the Reading Tutor but to one particular version of it. The user could not
specify which events to select, how to summarize them, or (other than by deciding how far
to drill down) which details to include.
facilitates running a query on one school year’s data at a time, for example to develop and
refine a hypothesis using data from 1 year, and then test it on data from other years.
Also, each team member has his or her own database to modify freely without fear of
altering archival data. Making it easy for researchers to create new tables and views that
are readily accessible to each other is key. This step enables best practices to propagate
quickly, whereas version skew can become a problem if researchers keep private copies of
data on their own computers.
28.1.2.4 Include Computer, Student ID, and Start Time as Standard Fields
The key insight here is that student, computer, and time typically suffice to identify a
unique tutorial interaction of a given type. Together they distinguish the interaction from
those of another type, computer, or student. (We include computer name in case the stu-
dent ID is not unique, and also because some events, such as launching the Reading Tutor,
do not involve a student ID.) There are two reasons this idea is powerful. First, these fields
serve as a primary key for every table in the database, simplifying access and shortening
the learning curve for working with the data. Second, nearly every tutor makes use of the
concepts of students, computers, and time, so this recommendation is broadly applicable.
28.1.2.8 Index Event Tables by Computer, Student ID, and Start Time
Database indices enable fast retrieval even from tables of millions of events. For example,
given computer name, student ID, and start time, retrieval from the 2003–2004 table of
10,765,345 word hypotheses output by the speech recognizer takes only 15â•›ms—3000 times
faster than the 45.860â•›s it would take to scan the table without using indices.
the precise time at which the student did not read a skipped word is undefined, so the
Reading Tutor logs its start and end time as null. Fortunately, the Reading Tutor also logs
the skipped word’s sentence_encounter_start_time, and the position in the text sentence
where it should have been read. These two pieces of information about the skipped word
are non-null, enabling analysis queries and the Session Browser to retrieve words as
ordered in the sentence, whether or not the student read them.
FIGURE 28.3
Event form.
Using this feature requires knowing the precise time when the event of interest occurred.
We knew the computer and user_id that generated the screenshot in Figure 28.2, but did
not know exactly when. What to do?
FIGURE 28.4
Event query.
FIGURE 28.5
Table of events returned by query.
To translate the result of a query into a set of tutorial events, the Session Browser scans
the labels returned as part of the query, and finds the columns for student, computer,
start time, and end time. As recommended in Guideline 1.2.6 above, the Session Browser
assumes standard names for these columns, e.g., “user_id” for student, “machine_name”
for computer, and “start_time” for start time. If necessary, the researcher can apply this
naming convention after the fact, e.g., by inserting “as start_time” in the query in order to
relabel a column named differently.
An extensively used tutor collects too much data to inspect all of it by hand, so the first
step in mining it is to select a sample. For instance, this query selects a random sample of
10 from the table of student utterances:
Whether the task is to spot-check for bugs, identify common cases, formulate hypotheses,
or check their sanity, our mantra is “check (at least) ten random examples.” Random selec-
tion assures variety and avoids the sample bias of, e.g., picking the first 10 examples in
the database. For example, random examples quickly revealed Session Browser bugs not
manifest when using the standard test examples used previously.
Our queries typically focus on a particular phenomenon of interest, such as the set of
questions that students took longest to answer, or steps where they got stuck long enough
for the Reading Tutor to prompt them. Exploring examples of such phenomena can help
the researcher spot common features and formulate causal hypotheses to test with statisti-
cal methods on aggregated data.
For example, one such phenomenon was a particular student behavior—namely, click-
ing Back out of stories. The Reading Tutor has Go and Back buttons to navigate forward or
backward in a story. In the 2004–2005 version, the Go button advanced to the next sentence,
and the Back button returned to the preceding sentence. We had previously observed that
students sometimes backed out of a story by clicking Back repeatedly even after they had
invested considerable time in the story. We were interested in understanding what might
precipitate this presumably undesirable behavior. This query finds a random sample of 10
stories that students backed out of after more than 60 s:
FIGURE 28.6
Specifying events by similarity to another event.
FIGURE 28.7
Attribute-value list for an event record.
by exploiting the natural hierarchical structure of nested time intervals, which Section
28.3.1 now defines.
28.3.1.3 Siblings
• A younger sibling has the same parent(s) as its older sibling(s) but starts later.
• An eldest child has no older siblings.
Ancestors
Parent
Older siblings Focal event
Younger siblings
Equals
Eldest child Children
Descendants
FIGURE 28.8
Time intervals with various temporal relations to a focal event.
• The hiatus between a parent and an eldest child is the difference between their
start times.
• The hiatus between successive siblings is the difference between the older sib-
ling’s end time and the younger sibling’s start time.
FIGURE 28.9
Event tree shows context of the highlighted event.
FIGURE 28.10
Hierarchical context and partially expanded details of (another) highlighted event.
FIGURE 28.11
What occurred before backing out of a story?
beforehand lasted 14–39â•›s reflects very slow reading, averaging a few seconds per word.
The fact that the subsequent hiatuses from when the Reading Tutor displayed a sentence
to when the student clicked Back were less than 100â•›ms long indicates that the student was
clicking repeatedly as fast as possible.
28.4.4╇ Annotations
Right-clicking on an event and selecting Annotations brings up a table listing its current
annotations, as in Figure 28.13. Clicking on an annotation lets the user edit or hide its
value. Clicking on the blank bottom row of the Annotate a Record table lets the user add a
new attribute.
The annotation feature is relatively recent (August 2008). So far we have used it primar-
ily for two published analyses. One analysis examined students’ free-form answers to
comprehension questions [25]. The annotations specified how an expert teacher would
respond, what features of the answer called for this response, and what purpose it was
intended to achieve. The purpose of the analysis was to identify useful features of students’
FIGURE 28.12
Audio transcription window.
FIGURE 28.13
Annotations of an event.
free-form answers to try to detect automatically. In the other analysis [26], two human
judges hand-rated children’s recorded oral reading utterances on a four-dimensional flu-
ency rubric, with one attribute for each dimension. The purpose of this analysis was to
provide a human “gold standard” against which to evaluate automated measurements
of oral reading fluency. The reason for having a second judge was to compute inter-rater
reliability, so it was important for the two judges’ ratings to be independent. We used the
ability to hide attributes to conceal the first judge’s ratings, and added a parallel set of four
attributes for the second judge to rate the same utterances.
We implement annotations by representing them as pseudo-events in order to use the
same code that retrieves regular events, integrates them into event trees, and displays
them. Therefore, the Session Browser stores the attribute, value, author, and date of an
annotation as a record in a global table of annotations, along with the annotated event’s
computer, student, and time interval. An event and its annotations are equals as defined
in Section 28.3.1.2, so they appear at the same depth in the event tree.
Prior to implementing the annotation mechanism, we wrote special-purpose queries
to collect the information we thought the annotator would need, and exported the result-
ing table into an Excel spreadsheet for the annotator to fill out. For example, the Reading
Tutor recorded students’ free-form answers to comprehension questions [25]. We sent our
expert reading teacher the transcribed answers in a spreadsheet with columns to fill in to
describe how she would respond to each answer, under what circumstances, and for what
purpose. We included columns for the context we thought she would need, such as which
prompt the student was answering.
The annotation mechanism offers a number of advantages over the spreadsheet
approach. First, it eliminates the need to write special-purpose queries (typically com-
plex joins) to assemble the context needed by the annotator, because the Session Browser
already displays event context. Second, it avoids the need to anticipate which context the
annotator will require, because the annotator can use the Session Browser to explore addi-
tional context if need be. Third, storing annotations in the database instead of in a spread-
sheet supports the ongoing arrival and transcription of new data. Sending spreadsheets
back and forth is a cumbersome batch solution. In contrast, storing annotations directly in
the database makes it easier to keep up with new data. A simple query enables the annota-
tor to enumerate events not yet annotated.
along with the results of the query. By following the data-logging guidelines in Section
28.1.2, we exploit the assumption that the names and meanings of fields are mostly consis-
tent across database tables and over time. Thus the code assumes particular field names
for student, machine, and start and end times, but overrides this convention for excep-
tions. For example, the normal method to extract the start time of an event looks for a field
named Start_time, but is overridden for particular tables that happen to call it something
else, such as Time for instantaneous types of events, or Launch_time for the table that logs
each launch of the Reading Tutor.
As explained in Section 28.3.2, the method to compute the context of a selected target
event is as follows: First, extract its student, computer, and start time. Then query every
table of the database for events that involve the same student and computer and contain
the start of the target event. Finally, sort the retrieved records according to the ancestor
relation, and display them accordingly by inserting them in the appropriate positions in
the expandable tree widget. The method to find the children of a given event fires only
when needed to expand the event node. It finds descendants in much the same way as the
method to find ancestors, but then winnows them down to the children (those that are not
descendants of others).
Rather than query every table to find ancestors and descendants of a given event, a
more knowledge-based method would know which types of Reading Tutor events can be
parents of which others. However, this knowledge would be tutor- and, possibly, version-
specific. In contrast, our brute force solution of querying all tables requires no such knowl-
edge. Moreover, its extra computation is not a problem in practice. Our databases consist of
a few dozen tables, the largest of which have tens of millions of records. Despite this table
size, the Session Browser typically computes the context of an event with little or no delay.
FIGURE 28.14
Select database and tables.
410 Handbook of Educational Data Mining
The databases menu shows the currently selected database(s). We have a separate
database for archival data from each school year. In addition, each member of our
research team has a personal database so as to protect archival data when using it
to build and modify new tables, which may augment archival tables with additional
derived information. Selecting a personal database in addition to an archival database
causes the Session Browser to include it as well when searching for an event’s ancestors
or descendants.
The tables menu lists the tables in the most recently selected database. We assume a
database has a different table for each type of event. The checkbox for each table in the
selected database specifies whether to include that type of event in the event tree. For
example, turning on the audio_output table tells the Session Browser to include the Reading
Tutor’s own logged speech output in the event tree, which is an easy way to view the
tutor’s side of the tutorial dialogue. The ability to select which tables to include in the event
tree helps focus on the types of events relevant to the particular task at hand by excluding
irrelevant details.
The checkboxes do not distinguish among events of the same type. For instance, a user
might want the event tree to include the tutor’s spoken tutorial assistance (e.g., “rhymes with
peach”) but not its backchanneling (e.g., “mmm”). User-programmable filters would allow
such finer-grained distinctions, but not be as simple as using check boxes to specify which
event types to include. So far, the check boxes have afforded us sufficient control.
Mousing over a table in the tables menu displays its list of fields in the fields menu.
We use this feature primarily as a handy reminder of what fields are in a given table.
However, we can also uncheck the box next to a field to omit that field when displaying
the attribute-value list representation for an event record. For example, the audio_record-
ing table includes a field for the audio input logged by the tutor. The value of this field can
occupy enough memory to cause a noticeable lag in retrieving it from the database server.
Unchecking the field to omit it from the display avoids this lag. In fact, unlike other fields,
the default setting for this particular field is to leave it unchecked.
We added the construct @this.field@ to refer to a field of the event being summarized.
The Session Browser replaces this construct with regular MySQL before executing the
Lessons from Project LISTEN’s Session Browser 411
query. Combined with the generic mechanism for event duration and hiatus, this query
produces summary lines such as “34 second(s) long: Session 2008-10-01 14:31:46.”
Other queries are more complex. For example, the summary line for an utterance shows
text words the speech recognizer accepted in UPPER CASE, substitutions in lower case,
and omitted words as –. (The omission of a word constitutes the nonoccurrence of an event
as discussed in Section 28.1.2.10.) The query to summarize an utterance is
For example, a summary of one utterance for the sentence “Don’t ya’ forget to wear
green on St Patrick’s Day!” is
no audio: 4 second(s) earlier, 4 second(s) long: DON’T YA start_forget(2f_axr) TO WEAR
––––––
Here “start_forget(2f_axr)” is a phonetic truncation recognized instead of “forget.”
28.6╇ Conclusion
We conclude by summarizing research contributions, acknowledging some limitations,
and evaluating the Session Browser.
28.6.2╇ Evaluation
Relevant criteria for evaluating the Session Browser include implementation cost, effi-
ciency, generality, usability, and utility.
28.6.2.2 Efficiency
Running both the database server and the Session Browser on ordinary PCs, we routinely
explore databases with data from hundreds of students, thousands of hours of tutorial
interaction, and millions of words. The operations reported here often update the display
with no perceptible lag, though a complex query to find a specified set of events may take
several seconds or more.
28.6.2.3 Generality
Structural evidence of the Session Browser’s generality includes its largely tutor-indepen-
dent design, reflected in brevity of code (103 .java files totaling 799 kB, with median size
414 Handbook of Educational Data Mining
4 kB) and relative scarcity of references to specific tables or fields of the database. Both
these measures would improve by deleting unused code left over from the initial tutor-
specific implementation. Empirical evidence of generality includes successful use of the
Session Browser with archival databases from different years’ versions of the Reading
Tutor as well as with derivative databases constructed by members of Project LISTEN.
Other researchers have not as yet adapted the Session Browser to databases from tutors
besides the Reading Tutor, in part because most tutors still log to files, not databases.
However, it may be feasible to use the Session Browser with data in PSLC’s DataShop
[16] logged by various cognitive tutors. These tutors typically log in XML, which a PSLC
translation routine parses and imports into a database. Datashop users don’t even need to
know the structure of the database; they just need to output or convert log data into PSLC’s
XML format, and to be sure that the log details are sufficient.
The PSLC Datashop database has a logically and temporally hierarchical event structure
(imputed by its XML translator), which generally obeys the semantic properties required
by the Session Browser. Tutorial interaction could lack such structure if it consisted of tem-
porally interleaved but logically unrelated events, or were based primarily on other sorts
of relations, such as the anaphoric relation between antecedent and referent in natural lan-
guage dialog. However, tutorial data from the PSLC Datashop appears to have hierarchical
temporal structure.
We are currently investigating the feasibility of applying the Session Browser to that
data by modifying it and/or the PSLC database to use compatible representations, e.g.,
one table per event type, with fields for student, machine, and event and start–end times,
and conventions for naming them. If so, the Session Browser could apply to many more
tutors—at least if their interactions have the largely hierarchical temporal structure that it
exploits, displays, and manipulates.
28.6.2.4 Usability
The Session Browser is now used regularly by several members of Project LISTEN after at
most a few minutes of instruction. Usability is hard to quantify. However, we can claim
a 10- or 100-fold reduction in keystrokes compared to obtaining the same information
by querying the database directly. For example, clicking on events in the list of query
results displays their context as a chain of ancestor events. Identifying these ancestors by
querying the database directly would require querying a separate table for each ancestor.
Moreover, the Session Browser’s graphical user interface enables users to explore event
context without knowing how to write queries, and stored configurations make it easy for
them to retrieve events to transcribe or annotate.
28.6.2.5 Utility
The ultimate test of the Session Browser is whether it leads to useful discoveries, or at
least sufficiently facilitates the process of educational data mining that researchers find
it helpful and keep using it. We cannot as yet attribute a publishable scientific discovery
to a eureka moment in the Session Browser, nor do we necessarily expect one, because
scientific discovery tends to be a gradual, multi-step process rather than a single flash of
insight. However, the Session Browser has served as a useful tool in some of our published
research. Time will clarify its value for us and, if it is extended to work with Datashop-
compatible tutors, for other researchers as well.
Lessons from Project LISTEN’s Session Browser 415
Acknowledgments
This work was supported in part by the National Science Foundation under ITR/IERI
Grant No. REC-0326153 and in part by the Institute of Education Sciences, U.S. Department
of Education, through Grants R305B070458, R305A080157, and R305A080628 to Carnegie
Mellon University. Any opinions, findings, conclusions, or recommendations expressed
in this publication are those of the authors and do not necessarily reflect the views of
the Institute, the U.S. Department of Education, or the National Science Foundation, or
the official policies, either expressed or implied, of the sponsors or of the United States
Government. We thank the educators and students who generated our data, the members
of Project LISTEN who use the Session Browser, the contributors to the 2005 papers on
which much of this chapter is based [1–3], Thomas Harris for exploring the feasibility
of using the Session Browser with data from PSLC’s DataShop, and Ryan Baker and the
reviewers for helpful comments on earlier drafts.
References
(Project LISTEN publications are available at www.cs.cmu.edu/~listen)
1. Mostow, J., Beck, J., Cen, H., Cuneo, A., Gouvea, E., and Heiner, C., An educational data min-
ing tool to browse tutor-student interactions: Time will tell!, in Proceedings of the Workshop on
Educational Data Mining, National Conference on Artificial Intelligence, Beck, J. E. (Ed.), AAAI
Press, Pittsburgh, PA, 2005, pp. 15–22.
2. Mostow, J., Beck, J., Cen, H., Gouvea, E., and Heiner, C., Interactive demonstration of a generic
tool to browse tutor-student interactions, in Interactive Events Proceedings of the 12th International
Conference on Artificial Intelligence in Education (AIED 2005), Amsterdam, the Netherlands, 2005,
pp. 29–32.
3. Mostow, J., Beck, J., Cuneo, A., Gouvea, E., and Heiner, C., A generic tool to browse tutor-
student interactions: Time will tell!, in Proceedings of the 12th International Conference on Artificial
Intelligence in Education (AIED 2005), Amsterdam, the Netherlands, 2005, pp. 884–886.
4. Beck, J. E., Proceedings of the ITS2004 Workshop on Analyzing Student-Tutor Interaction Logs to
Improve Educational Outcomes, Maceio, Brazil, August 30–September 3, 2004.
5. Mostow, J. and Aist, G., Evaluating tutors that listen: An overview of Project LISTEN, in Smart
Machines in Education, Forbus, K. and Feltovich, P. (Eds.), MIT/AAAI Press, Menlo Park, CA,
2001, pp. 169–234.
6. Mostow, J. and Beck, J., When the rubber meets the road: Lessons from the in-school adven-
tures of an automated Reading Tutor that listens, in Scale-Up in Education, Schneider, B. and
McDonald, S.-K. (Eds.), Rowman & Littlefield Publishers, Lanham, MD, 2007, pp. 183–200.
7. Mostow, J., Aist, G., Burkhead, P., Corbett, A., Cuneo, A., Eitelman, S., Huang, C., Junker, B.,
Platz, C., Sklar, M. B., and Tobin, B., A controlled evaluation of computer- versus human-
assisted oral reading, in Artificial Intelligence in Education: AI-ED in the Wired and Wireless Future,
Moore, J. D., Redfield, C. L., and Johnson, W. L. (Eds.), IOS Press, San Antonio, TX, Amsterdam,
the Netherlands, 2001, pp. 586–588.
8. Mostow, J., Aist, G., Huang, C., Junker, B., Kennedy, R., Lan, H., Latimer, D., O’Connor, R., Tassone,
R., Tobin, B., and Wierman, A., 4-Month evaluation of a learner-controlled Reading Tutor that lis-
tens, in The Path of Speech Technologies in Computer Assisted Language Learning: From Research Toward
Practice, Holland, V. M. and Fisher, F. P. Routledge (Eds.), New York, 2008, pp. 201–219.
9. Poulsen, R., Wiemer-Hastings, P., and Allbritton, D., Tutoring bilingual students with an auto-
mated Reading Tutor that listens, Journal of Educational Computing Research 36(2), 191–221, 2007.
416 Handbook of Educational Data Mining
10. Mostow, J., Aist, G., Bey, J., Burkhead, P., Cuneo, A., Junker, B., Rossbach, S., Tobin, B., Valeri, J.,
and Wilson, S., Independent practice versus computer-guided oral reading: Equal-time compari-
son of sustained silent reading to an automated reading tutor that listens, in Ninth Annual Meeting
of the Society for the Scientific Study of Reading, Williams, J. (Ed.), Chicago, IL, June 27–30, 2002.
11. Aist, G., Towards automatic glossarization: Automatically constructing and administering
vocabulary assistance factoids and multiple-choice assessment, International Journal of Artificial
Intelligence in Education 12, 212–231, 2001.
12. Mostow, J., Beck, J., Bey, J., Cuneo, A., Sison, J., Tobin, B., and Valeri, J., Using automated
questions to assess reading comprehension, vocabulary, and effects of tutorial interventions,
Technology, Instruction, Cognition and Learning 2, 97–134, 2004.
13. Mostow, J., Beck, J. E., and Heiner, C., Which help helps? Effects of various types of help on
word learning in an automated Reading Tutor that listens, in Eleventh Annual Meeting of the
Society for the Scientific Study of Reading, Reitsma, P. (Ed.), Amsterdam, the Netherlands, 2004.
14. Corbett, A. and Anderson, J., Knowledge tracing: Modeling the acquisition of procedural
knowledge, User Modeling and User-Adapted Interaction, 4, 253–278, 1995.
15. Beck, J. E. and Mostow, J., How who should practice: Using learning decomposition to evaluate the
efficacy of different types of practice for different types of students, in Ninth International Conference
on Intelligent Tutoring Systems, Montreal, Canada, 2008, pp. 353–362. Nominated for Best Paper.
16. Koedinger, K. R., Baker, R. S. J. d., Cunningham, K., Skogsholm, A., Leber, B., and Stamper, J., A
data repository for the EDM community: The PSLC DataShop, in Handbook of Educational Data
Mining, Romero, C., Ventura, S., Pechenizkiy, M., and Baker, R. S. J. d. (Eds.), CRC Press, Boca
Raton, FL, 2010, pp. 43–56.
17. Shavelson, R. J. and Towne, L., Scientific Research in Education, National Academy Press, National
Research Council, Washington, DC, 2002.
18. Mostow, J., Beck, J., Chalasani, R., Cuneo, A., and Jia, P., Viewing and analyzing multi-
modal human-computer tutorial dialogue: A database approach, in Proceedings of the Fourth
IEEE International Conference on Multimodal Interfaces (ICMI 2002) IEEE, Pittsburgh, PA, 2002,
pp. 129–134. First presented June 4, 2002, at the ITS 2002 Workshop on Empirical Methods for
Tutorial Dialogue Systems, San Sebastian, Spain.
19. Mostow, J. and Beck, J. E., Why, what, and how to log? Lessons from LISTEN, in Proceedings of
the Second International Conference on Educational Data Mining, Córdoba, Spain, 2009, pp. 269–278.
20. Alpern, M., Minardo, K., O’Toole, M., Quinn, A., and Ritzie, S., Unpublished Group Project for
Masters’ Lab in Human-Computer Interaction, 2001.
21. Beck, J. E., Chang, K.-m., Mostow, J., and Corbett, A., Does Help help? Introducing the Bayesian
evaluation and assessment methodology, in Ninth International Conference on Intelligent Tutoring
Systems, Montreal, Canada, 2008, pp. 383–394. ITS2008 Best Paper Award.
22. Heiner, C., Beck, J. E., and Mostow, J., When do students interrupt help? Effects of time, help
type, and individual differences, in Proceedings of the 12th International Conference on Artificial
Intelligence in Education (AIED 2005), Looi, C.-K., McCalla, G., Bredeweg, B., and Breuker, J.
(Eds.), IOS Press, Amsterdam, the Netherlands, 2005, pp. 819–826.
23. Heiner, C., Beck, J. E., and Mostow, J., Improving the help selection policy in a Reading Tutor
that listens, in Proceedings of the InSTIL/ICALL Symposium on NLP and Speech Technologies in
Advanced Language Learning Systems, Venice, Italy, 2004, pp. 195–198.
24. MySQL, Online MySQL Documentation at http://dev.mysql.com/doc/mysql, 2004.
25. Zhang, X., Mostow, J., Duke, N. K., Trotochaud, C., Valeri, J., and Corbett, A., Mining free-
form spoken responses to Tutor prompts, in Proceedings of the First International Conference on
Educational Data Mining, Baker, R. S. J. d., Barnes, T., and Beck, J. E. (Eds.), Montreal, Canada,
2008, pp. 234–241.
26. Mostow, J. and Duong, M., Automated assessment of oral reading prosody, in Proceedings of
the 14th International Conference on Artificial Intelligence in Education (AIED2009), Dimitrova, V.,
Mizoguchi, R., Boulay, B. d., and Graesser, A. (Eds.), IOS Press, Brighton, U.K., 2009, pp. 189–196.
29
Using Fine-Grained Skill Models to Fit Student
Performance with Bayesian Networks
Contents
29.1 Introduction......................................................................................................................... 417
29.1.1 Background on the MCAS Test............................................................................. 418
29.1.2 Background on the ASSISTment System............................................................. 418
29.2 Models: Creation of the Fine-Grained Skill Model........................................................ 419
29.2.1 How the Skill Mapping Was Used to Create a Bayesian Network.................. 420
29.2.2 Model Prediction Procedure................................................................................. 421
29.3 Results.................................................................................................................................. 421
29.3.1 Internal/Online Data Prediction Results............................................................ 423
29.4 Discussion and Conclusions.............................................................................................423
Acknowledgments....................................................................................................................... 424
References......................................................................................................................................425
29.1╇ Introduction
The largest standardized tests (such as the SAT or GRE) are what psychometricians call
“unidimensional” in that they are analyzed as if all the questions are tapping a single
latent trait. However, cognitive scientists such as Anderson and Lebiere [1] believe that
students are learning individual skills and might learn one skill but not another. Among
the reasons, psychometricians analyze large-scale tests in a unidimensional manner, such
as our colleagues have done with item-response models [2,4], is that student performance
on different skills is usually highly correlated, even if there is no necessary perquisite
relationship between these skills. We are engaged in an effort to investigate if we can do
a better job of predicting student performance by modeling skills at a fine-grained level.
Four different skill models* are considered: one that is unidimensional, WPI-1; one that has
five skills, WPI-5; one that has 39 skills, WPI-39; and our most fine-grained model that has
106 skills, WPI-106. We will refer to a tagging of skills to questions as a skill model. We
will compare skill models that differ in the number of skills and use Bayesian networks
to see how well the different models fit a dataset of student responses collected via the
ASSISTment system.
* A skill model is referred to as a “Q-matrix” by some AI researchers [5] and psychometricians [19].
417
418 Handbook of Educational Data Mining
There are many researchers in user modeling and educational data mining communities
working with Intelligent Tutoring Systems who have adopted Bayesian network methods
for modeling knowledge [3,7,12,21]. Even methods not originally thought of as Bayesian
network methods turned out to be so; Reye [18] showed that the classic Corbett and
Anderson “Knowledge tracing” approach [8] was a special case of a dynamic belief net-
work. While we make use of knowledge tracing in our analysis, other notable work takes a
different approach. Wilson’s learning progressions [20], for example, tracks different levels
of knowledge within a concept map based on misconceptions detected according to the
student’s multiple-choice answer selection. A student’s progress is tracked longitudinally
across concept maps. This approach would perhaps be the strongest when there are clear
misconceptions to track. We instead follow in the tradition of Anderson and VanLehn–
style cognitive tutors where knowledge is treated as a binary random variable and where
the subject of skill granularity is most relevant.
We are not the first to do model selection based on how well the model fits real student
data [10,12]. Nor are we the only ones concerned with the question of granularity. Greer
and colleagues [11] have proposed methods of assessment using different levels of granu-
larity to conceptualize student knowledge. However, we are not aware of other work prior
to this* work where researchers attempted to empirically answer the question of “what is
the right level of granularity to best fit a dataset of student responses.”
* This chapter is an expansion of work presented at the workshop on Educational Data Mining held at the
18th International Conference on Intelligent Tutoring Systems (2006) in Taiwan [13] and the 11th International
Conference on User Modeling (2007) in Corfu, Greece [14].
Using Fine-Grained Skill Models to Fit Student Performance with Bayesian Networks 419
Figure 29.1
An ASSISTment showing the original question and the first two scaffolds.
presented with scaffold questions that break the original problem into separate skills. On
average, students in our 2004–2005 dataset answered 100 original questions and 160 scaf-
fold questions. A student’s response was only marked as correct if he or she answered the
question correctly on the first attempt without assistance from the system. The MCAS items
from the 2005 test, publically released in June of 2005, were tagged with skills shortly after
release but before we received any of the students’ official scores for the test from the state.
Figure 29.2
Questions are tagged with the WPI-106 that is mapped to the other skill models.
came up with a number of skills that were somewhat fine-grained but not so fine-grained
that each item had a different skill. Therefore, we imposed upon our subject-matter expert
the restriction that no one item would be tagged with more than three skills. This model
is referred to as the April Model or WPI-106, due to the fact that when she was done, there
were 106 knowledge components that that model attempted to track. We also wanted
to use coarser skill models so we borrowed skill names from The National Council of
Teachers of Mathematics and the Massachusetts Department of Education who use broad
classifications of 5- and 39-skill sets. The 5- and 39-skill classifications were not tagged to
the questions. Instead, the skills in the coarse-grained models were mapped to skills in the
finest-grained model in a “is a part of” type of hierarchy, as opposed to a prerequisite hier-
archy [6]. The appropriate question-skill tagging for the WPI-5 and WPI-39 models could
therefore be derived from the WPI-106, as illustrated in Figure 29.2. Items could be tagged
with up to three skills. The state’s own choice of skill tags for an item was not used because
their model only allows a question to be tagged with a single skill. Comparing a single skill
per question model to a multi skill per question model would not be a fair comparison.
29.2.1╇ How the Skill Mapping Was Used to Create a Bayesian Network
Our Bayesian networks consisted of three layers of binomial random variable nodes, as
illustrated in Figure 29.3. A separate network was created for each skill model. The top
layer nodes represent knowledge of a skill that was set to a prior probability of 0.50. This
model is simple and assumes all skills are as equally likely to be known prior to being
given any evidence of student responses. Once we present the network with evidence,
it can quickly infer probabilities about what the student knows. The bottom layer nodes
are the question nodes with conditional probabilities set ad hoc to 0.10 for the probability
of answering correctly without knowing the skill, guess, and 0.05 for the probability of
answering incorrectly if the skill is known, slip. The intermediary secondary layer consists
of AND* gates that, in part, allowed us to only specify a guess and slip parameter for the
* An “ALL” gate is equivalent to a logical AND. Kevin Murphy’s Bayes Net Toolbox (BNT) evaluates MATLAB®’s
ALL function to represent the heolean node. This function takes a vector of values as opposed to only two
values if using the AND function. Since a question node may have three skills tagged to it, the ALL function
is used.
Using Fine-Grained Skill Models to Fit Student Performance with Bayesian Networks 421
Figure 29.3
Example of the Bayesian network topology for an ASSISTment question with two scaffolds. P(q=True|Gate=False)
is the guess. P(q=False|Gate=True) is the slip.
question nodes regardless of how many skills were tagged to them. Our colleagues [2]
investigated using a compensatory model with the same dataset but we found [15] that a
conjunctive, AND, is very well suited to model the composition of multiple skills. When
predicting MCAS test questions, a guess value of 0.25 was used to reflect the fact that the
MCAS items being predicted were all multiple-choice (one correct answer out of four),
while most of the online ASSISTment system items have text-input fields as the answer
type. Future research will explore learning the parameters from data.
29.3╇ Results
An early version of the results in this section (using approximate inference instead of exact
inference and without Section 29.3.1) appears in a workshop paper [13]. The mean absolute
difference (MAD) score is the score between the predicted and the actual score. The under/
over prediction is our predicted average score minus the actual average score on the test.
422 Handbook of Educational Data Mining
Table 29.1
Tabular Illustration of Error Calculation
Test Skill Tagging User 1 User 2 User Average
Question (WPI-5) P(q) P(q) … 600 P(q) Error
1 Patterns 0.2 0.9 … 0.4
2 Patterns 0.2 0.9 … 0.4
3 Patterns and 0.1 0.5 … 0.2
measurement
4 Measurement 0.8 0.8 … 0.3
5 Patterns 0.2 0.9 … 0.4
: : : : : :
29 Geometry 0.7 0.7 … 0.2
Predicted score 14.2 27.8 … 5.45
Actual score 18 23 … 9
Error 10.34% 19.42% … 12.24% 17.28%
Table 29.2
Model Prediction Performance Results
for the MCAS Test
Model Error MAD Under/Over
WPI-39 12.86% 3.73 ↓ 1.4
WPI-106 14.45% 4.19 ↓ 1.2
WPI-5 17.28% 5.01 ↓ 3.6
WPI-1 22.31% 6.47 ↓ 4.3
Results were statistically significantly sepa-
rable from each other at the pâ•›<â•›0.05
level.
The results in Table 29.2, show that the WPI-39 had the best accuracy with an error of
12.86% that translates to a raw score error of 3.73. The finest-grain model, the WPI-106, came
in second followed by the WPI-5 and finally the WPI-1. We can conclude that the fine-grain
models are best for predicting the external test but that the finest-grain model did not pro-
vide the best fit to the data. An analysis [16] of error residuals revealed that test questions
that were poorly predicted had a dramatically higher percentage of correctness on the test
than questions on the ASSISTment system relating to the same skill. These ASSISTment
system questions all had text-input fields (fill in the blank) question types. The conclusion
drawn was that the multiple-choice question type of the test made some questions much
easier than their ASSISTment system counterparts to an extent not captured by the guess
and slip of the model. This disparity in difficulty likely attributed to the consistent under
prediction of performance on the test. Learning the guess and slip parameters of the tutor
and test questions can help correct for this variation in performance. All results in Table
29.2 were statistically significantly separable from each other at the pâ•›<â•›0.05 level.*
* We compared [16] this Bayesian method to a mixed-effects [9] test prediction that was run with the same data
and found that the best mixed-effects model came up 0.05% short of the Bayesian method’s best model. Both
approaches agreed that the two finest-grained models were most accurate. Internal fit was not run with the
mixed-effects approach.
Using Fine-Grained Skill Models to Fit Student Performance with Bayesian Networks 423
Table 29.3
Model Prediction Performance
Results for Internal Fit
Model Error MAD Under/Over
WPI-106 5.50% 15.25 ↓ 12.31
WPI-39 9.56% 26.70 ↓ 20.14
WPI-5 17.04% 45.15 ↓ 31.60
WPI-1 26.86% 69.92 ↓ 42.17
Note: Results were statistically significantly
separable from each other at the
pâ•›<â•›0.05 level.
Figure 29.4
Class skill report generated by the WPI-106 fine-grained skill model.
error of the coarse-grained WPI-1 and WPI-5 models remained relatively steady between
predicting the test and responses within the tutor. We believe that the finer-grained mod-
els’ reduced performance on test prediction might be due to questions on the tutor with
a much higher difficulty than questions of the same skill on the test. The skill of ‘Venn-
Diagram’ from the WPI-106, for example, has two questions on the ASSISTment system,
both of fill in the blank–type answers with an average correctness of 18.2% on the system.
A similar ‘Venn-Diagram’ question appeared on the state standardized test in multiple-
choice form and recorded a correctness of 87.3% on the test. Coarser-grained models may
be less susceptible to this variance in difficulty because knowledge estimates are averaged
over a wider variety of questions. Thirty-five questions and eight WPI-106 skills, including
‘Venn-Diagram,’ are represented by the WPI-39 skill of ‘understanding-data-presentation-
techniques.’ Eighty-three questions and sixteen of the WPI-106 skills, including ‘Venn-
Diagram,’ are represented by the WPI-5 skill of ‘data-analysis-statistics-probability.’
Some of our colleagues have perused item-response models for this very dataset [2,4] with
considerable success. We think that item-response models do not help teachers identify
what skills a students should work on, so even though they might be very good predictors
of student responses, they suffers in other ways. Part of the utility of fine-grained model-
ing is being able to identify skills that students have mastered and those they still need to
master. An example of a class skill report presented to a teacher is shown in Figure 29.4.
We think that this work is important in that while adapting fine-grained models is hard,
we have shown that fine-grained modeling can produce a highly accurate prediction of
student performance. Future work on this topic should include learning guess and slip
parameters that provide a better fit to the data. Also worth investigating would be using
a temporal Bayesian network. Using a temporal framework would allow for learning to
be modeled and would take into account that a student’s most recent responses should be
weighted more heavily in assessing their current state of knowledge.
Acknowledgments
This research was made possible by the U.S. Department of Education, Institute of Education
Science “Effective Mathematics Education Research” program grant #R305K03140, the
Office of Naval Research grant #N00014-03-1-0221, NSF CAREER award to Neil Heffernan,
and the Spencer Foundation. All of the opinions in this article are those of the authors
and not those of any of the funders. This work would not have been possible without the
Using Fine-Grained Skill Models to Fit Student Performance with Bayesian Networks 425
assistance of the 2004–2005 WPI/CMU ASSISTment team that helped make possible this
dataset. The first author is an NSF GK-12 fellow.
References
1. Anderson, J. R. and Lebiere, C. (1998). The Atomic Component of Thought. Erlbaum, Mahwah, NJ.
2. Anozie, N. O. and Junker, B. W. (2006). Predicting end-of-year accountability assessment scores
from monthly student records in an online tutoring system. Proceedings of the AAAI-06 Workshop
on Educational Data Mining, Boston, MA. AAAI Technical Report WS-06-05, pp. 1–6.
3. Arroyo, I. and Woolf, B. (2005). Inferring learning and attitudes from a Bayesian Network of
log file data. Proceedings of the 12th International Conference on Artificial Intelligence in Education,
Amsterdam, the Netherlands, pp. 33–40.
4. Ayers, E. and Junker, B. W. (2006). Do skills combine additively to predict task difficulty in
eighth grade mathematics? In Beck, J., Aimeur, E., and Barnes, T. (eds.), Educational Data
Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. Technical Report WS-06-
05, pp. 14–20.
5. Barnes, T. (2005). Q-matrix method: Mining student response data for knowledge. Proceedings of
the AAAI-05 Workshop on Educational Data Mining, Pittsburgh, PA, 2005. AAAI Technical Report
#WS-05-02.
6. Carmona, C., Millán, E., de-la Cruz, J.-L.P., Trella, M., and Conejo, R. (2005). Introducing pre-
requisite relations in a multi-layered Bayesian student model. In Ardissono, L., Brna, P., and
Mitrovic, A. (eds.), User Modeling. Lecture Notes in Computer Science, Vol. 3538. Springer,
Berlin, Germany, pp. 347–356.
7. Conati, C., Gertner, A., and VanLehn, K. (2002). Using Bayesian networks to manage uncer-
tainty in student modeling. User Modeling and User-Adapted Interaction, 12(4), 371–417.
8. Corbett, A. T., Anderson, J. R. & O’Brien, A. T. (1995). Student modeling in the ACT program-
ming tutor. In Nichols, P., Chipman, S., and Brennan, R. (eds.), Cognitively diagnostic assessment.
Erlbaum, Hillsdale, NJ, pp. 19–41.
9. Feng, M., Heffernan, N. T., Mani, M., and Heffernan, C. (2006). Using mixed-effects model-
ing to compare different grain-sized skill models. In Beck, J., Aimeur, E., and Barnes, T. (eds.),
Educational Data Mining: Papers from the AAAI Workshop. AAAI Press. Technical Report WS-06–
05. ISBN 978-1-57735-287-7, pp. 57–66.
10. Mathan, S. and Koedinger, K. R. (2003). Recasting the feedback debate: Benefits of tutoring error
detection and correction skills. In Hoppe, U., Verdejo, F., and Kay, J. (eds.), Artificial Intelligence
in Education: Shaping the Future of Learning through Intelligent Technologies, Proceedings of AI-ED
2003. IOS Press, Amsterdam, the Netherlands, pp. 39–46.
11. McCalla, G. I. and Greer, J. E. (1994). Granularity-based reasoning and belief revision in stu-
dent models. In Greer, J. E. and McCalla, G. I. (eds.), Student Modelling: The Key to Individualized
Knowledge-Based Instruction. Springer-Verlag, Berlin, Germany, pp 39–62.
12. Mislevy, R.J. and Gitomer, D. H. (1996). The role of probability-based inference in an intelligent
tutoring system. User-Modeling and User Adapted Interaction, 5, 253–282.
13. Pardos, Z. A., Heffernan, N. T., Anderson, B., and Heffernan, C. L. (2006). Using fine-grained
skill models to fit student performance with Bayesian networks. Workshop in Educational Data
Mining held at the Eight International Conference on Intelligent Tutoring Systems, Taiwan. http://
www.educationaldatamining.org/ITS2006EDM/EDMITS2006.html
14. Pardos, Z. A., Heffernan, N. T., Anderson, B., and Heffernan, C. (2007). The effect of model
granularity on student performance prediction using Bayesian networks. Proceedings of the 11th
International Conference on User Modeling. Springer, Corfu, Greece, pp. 435–439. http://www.
educationaldatamining.org/UM2007.html
426 Handbook of Educational Data Mining
15. Pardos, Z. A., Heffernan, N. T., Ruiz, C., and Beck, J. (2008). The composition effect: Conjunctive
or compensatory? An analysis of multi-skill math questions in ITS. Proceedings of the First
Conference on Educational Data Mining, Montreal, Canada, pp. 147–156. http://ihelp.usask.ca/
iaied/ijaied/AIED2007/AIED-EDM_proceeding_full2.pdf
16. Pardos, Z. A., Feng, M., Heffernan, N. T., and Heffernan-Lindquist, C. (2007). Analyzing fine-
grained skill models using Bayesian and mixed effect methods. In Luckin, R. and Koedinger,
K. (eds.), Proceedings of the 13th Conference on Artificial Intelligence in Education. IOS Press,
Amsterdam, the Netherlands, pp. 626–628.
17. Razzaq, L., Heffernan, N., Feng, M., and Pardos, Z. (2007). Developing fine-grained transfer
models in the ASSISTment system. Journal of Technology, Instruction, Cognition, and Learning,
5(3), 289–304.
18. Reye, J. (2004). Student modelling based on belief networks. International Journal of Artificial
Intelligence in Education, 14, 63–96.
19. Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diag-
nosis. In Frederiksen, N., Glaser, R., Lesgold, A., and Shafto, M. G. (eds.), Diagnostic Monitoring
of Skill and Knowledge Acquisition. Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 453–488.
20. Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning pro-
gression. Journal of Research in Science Teaching, 46(6), 716–730.
21. Zapata-Rivera, J.-D. and Greer, J. E. (2004). Interacting with inspectable Bayesian models.
International Journal of Artificial Intelligence in Education, 14, 127–163.
30
Mining for Patterns of Incorrect Response
in Diagnostic Assessment Data
Contents
30.1 Introduction......................................................................................................................... 427
30.2 The DIAGNOSER................................................................................................................ 428
30.3 Method................................................................................................................................. 428
30.4 Results.................................................................................................................................. 432
30.4.1 Pair-Wise Analysis.................................................................................................. 433
30.4.2 Entropy-Based Clustering..................................................................................... 435
30.5 Discussion............................................................................................................................ 437
References...................................................................................................................................... 438
30.1╇ Introduction
It is a popular belief in education that students come into a classroom with ideas about
the material they are taught that can alter or interfere with their understanding of a topic.
A dramatic example is the video A Private Universe, where Harvard students in cap and
gown are asked to describe the cause of the seasons. Invariably, they fall back on the
misconception that seasons are caused by the distance of the earth from the sun [1]. It
is recommended that teachers should check the extent to which students hold errone-
ous concepts throughout instruction, ideally to deliver personalized feedback to students.
One such approach, developed by Minstrell, is called “diagnostic instruction” [2,3]. It is
based on the idea of delivering multiple-choice questions to students where each ques-
tion (and corresponding responses) attempts to “diagnose” a particular type of thinking.
If the thinking is incorrect, the teacher then uses the diagnosis to select an appropriate
intervention. In Minstrell’s framework, these types of fine-grained incorrect thinking are
called “facets” and are catalogued by topic. Student facets should theoretically result in
predictable behaviors; facets are defined as only “slight generalizations from what stu-
dents actually say or do in the classroom” [2]. A similar model-based feedback approach is
often implemented in intelligent tutoring systems, which recognizes incorrect responses
and provides targeted feedback (e.g., Andes [4], ASSISTment [5]).
However, feedback based on incorrect multiple-choice responses can potentially assume
an untested assumption—that students hold consistent incorrect beliefs, resulting in pre-
dictable responses, which must be individually addressed (as opposed to correcting the
mistake with a very limited model of student knowledge and reteaching the material).
427
428 Handbook of Educational Data Mining
Students may frequently choose the same incorrect answer to a single question, but it does
not necessarily mean that they are doing so because of a firm misunderstanding and will
therefore answer other questions with the same consistent error. For readability, we will
call these consistent errors “misconceptions” throughout this chapter, noting that a pat-
tern of consistent errors does not actually need to correspond to a logical misconception
in the educational sense—it can be evidence of a consistent pattern of fragmented under-
standing or a basic heuristic. We ask the following question: do students hold consistent
misconceptions in the timeframe of a single assessment? We examine data collected over
a period of 2.5 years from the DIAGNOSER [6], a web-based software system that delivers
diagnostic questions to students, to determine the answer to this question.
30.3╇ Method
We wish to determine whether students are consistent when reasoning using miscon-
ceptions. We examine this issue in two parts. First, we consider the consistency of rea-
soning across pairs of questions. Operationally, we ask: if we know a student’s incorrect
Mining for Patterns of Incorrect Response in Diagnostic Assessment Data 429
During lunch your science teacher has 30 min to run errands down the
corridor in preparation for the afternoon classes. Your teacher΄s motion is shown
in the position versus time graph below.
Position vs. time
30
25
Position (m)
20
15
10
5
0
0 5 10 15 20 25 30
Time (min)
What is the total distance your teacher traveled during lunch?
a. 5 m
b. 20 m
c. 25 m
d. 45 m
e. 60 m
Figure 30.1
Example diagnostic question targeting concepts of position and distance.
Table 30.1
Facet Descriptions Corresponding to Example Question in Figure 30.1
Response Description
(a) Student determines the distance traveled by giving the final
position.
(b) Student reports the farthest position traveled minus the end position
as the distance traveled.
(c) Student reports the farthest position as the distance traveled.
(d) Student correctly determines the distance traveled for motion in one
direction from a graphical representation.
(e) The student determines distance by adding several positions along
the trajectory.
multiple-choice response* to one question, does this help us to predict his or her response
to another question? For example, if a student says that a truck exerts more force upon a
fly that collides with its windshield than the fly does on the truck, we intuit that the stu-
dent might be thinking that greater force inflicts greater damage. We might predict that
when we ask the student to compare the force of a hammer upon a nail, the student will
respond that the hammer exerts more force than the nail. We use the concept of condi-
tional entropy to quantify the additional “information” gained from knowing a response
to a question. We mine the data from an atheoretical perspective.
* We use the term “response” throughout this chapter to refer to a specific multiple choice selection, not merely
whether the question was answered correctly or not.
430 Handbook of Educational Data Mining
This is pedagogically very important because teachers who use facet theory to guide
instruction implicitly assume that such predictions would hold, and gauge their interven-
tions accordingly. Similarly, an intelligent tutoring system could successfully base auto-
matic feedback on the underlying conceptual model. Students who can reason consistently
can learn from an experiment set up to demonstrate the flaws in their thinking. If, in the
opposite extreme, students answer diagnostic questions randomly, there is no information
to suggest an appropriate intervention beyond reteaching the material.
Second, patterns that span more than two questions may exist. To identify the prevalence
of patterns, we extend the concept of entropy to the problem of defining the consistency
of responding across several questions. Because some responses are conceptually “closer”
to each other than others, simple mining for patterns will yield far more patterns than is
pedagogically important. Our intuition is that if students are responding with some, but
not perfect, consistency, their response patterns should fall into clusters, where the pat-
terns within clusters are more similar to each other than they are to the response patterns
in other clusters. Note that these clusters are mathematically defined clusters of responses,
in contrast to “facet clusters” that are used in the DIAGNOSER system. If this were the case,
teachers could use the misconceptions that characterize each cluster, rather than miscon-
ceptions that characterize individual items, to identify what interventions are appropriate.
A traditional approach to this analysis would use item-response theory to fit a model to
the different incorrect responses (e.g., [9]). However, as written, DIAGNOSER question sets
intended for use in a classroom setting, do not provide a sufficient number and difficulty
of questions to obtain reliable estimates of facet difficulty. Furthermore, this approach
assumes that the relative difficulty of responses is the same for all students, which is an
assumption that DIAGNOSER explicitly rejects. We have extended a single question set as
a proof of concept that in fact such an approach can be used to model diagnostic responses,
and that at lower levels of mastery, students are inconsistent in their facet responses [10].
The entropy-based analysis described in this chapter is novel but, we believe, appropri-
ate to mining data from similar model-based assessment or tutoring systems.
In the context of an assessment, consider the questions X and Y, which have response
alternatives X1, X2 … Xk(X) and Y1, Y2 … Yk(Y) where k(X) and k(Y) are the number of allow-
able responses to X and Y. Define N(Xi) as the number of students choosing alternative Xi
and P(Xi) as the probability of choosing alternative Xi.
The entropy associated within each question X is
k(X )
Next, let N(XiYj) be the number of cases of responses Xi to question X and Yj to question Y
(iâ•›=â•›1…k(X), jâ•›=â•›1…k(Y)). Then
N (XiYj )
P(Yj |Xi ) = (30.2)
∑
k (Y )
N (XiYv )
v =1
k(X )
H (Y |X ) = − ∑ P(X ) * H(Y |X )
i =1
i j i (30.4)
We can use the entropy function to quantify the uncertainty within a question, and then
consider the information (or lack of uncertainty) that we can gain from knowing responses
to another question. This is the difference between the entropy of a question and the con-
ditional entropy, given that the response to a second question is known:
This is a measure of the extent to which responses to question X can be used to predict
responses to question Y. It is an effect size measure, i.e., it does not depend on the number
of cases.
Let us consider what this means given two questions X and Y on a test. If the response
to Y can be completely predicted by knowledge of the specific responses to X (X1…Xk), the
entropy of Y conditional upon X is zero. In information-theoretic terms, the mutual infor-
mation will be maximized. If the response to Y is independent of the response to X, there
will be no difference between the terms, and the response to X tells us nothing about how
students will respond to Y.
We can use the concept of entropy to place students into clusters according to their
responses to a set of multiple-choice questions to minimize the entropy of the entire sys-
tem. For example, suppose there is just one question and half of the students respond A,
and half respond B. By our definitions above, we see that the entropy of this random vari-
able is 1.0. If we can cluster the students into two groups, those who selected A and those
who selected B, we can reduce the conditional entropy of the system, given that group
membership is known, to zero.
More formally, we seek to minimize the expected entropy. We calculate the expected
entropy by multiplying the entropy of each cluster (the sum of the entropies of each attri-
bute) by the probability that an item falls into that cluster. Finding the optimal clustering
is an NP-complete problem. Efficient implementations of entropy-based clustering have
been described and implemented, such as COOLCAT [11] and LIMBO [12], and as we scale
to large datasets, we will need to evaluate such algorithms.
We evaluate the effectiveness of the clustering by calculating the categorical utility (CU)
[13] function, which gives an idea of how predictive the clustering is, penalized by the
number of clusters. The CU function is itself built up from a lower-order statistic, the Δ
index introduced by Goodman and Kruskal [13,14]. The Δ index is simply the decrease
in the number of erroneous predictions of a variable’s value that is obtained by knowing
which cluster the case being predicted is in, using the strategy of probability matching.
This is best seen by an example. Suppose that on a particular two-choice question half the
respondents say “Yes” and half say “No.” Given this information, one expects to make an
error in prediction half the time. Suppose further, though, that on the basis of other vari-
ables, respondents are clustered into two groups, A and B. In cluster A, 80% of the respon-
dents respond “Yes” and 20% respond “No.” If you guess according to the distribution of
responses, given the knowledge of cluster A, you will guess “Yes” 80% of the time and be
right 80% of those guesses, and you will guess “No” 20% of the time and be right 20% of
432 Handbook of Educational Data Mining
those guesses. Your correct guess rate will be 0.8â•›*â•›0.8â•›+â•›0.2â•›*â•›0.2â•›=â•›0.68. The Δ index for the
cluster is 0.68â•›−â•›0.50â•›=â•›0.18.
The formula for the Δ index is given as follows, where C is one of k clusters, Ai is the ques-
tion i, and each Ai can assume j values, Vij:
n
∆(C , Ai ) = ∑ P(C )∑ P(A = V |C ) − ∑ P(A = V )
k =1
k
j
i ij k
2
j
i ij
2 (30.6)
The CU function is the average of the Δ values for each of the n clusters. This introduces a
penalty factor that increases with the number of clusters:
∑ ∆(C, A )
1 (30.7)
CU (C ) = i
n
i
Therefore, the larger the CU function, the better the clustering, in the sense that knowl-
edge of the cluster helps to predict the responses to the questions.
30.4╇ Results
We examined responses to questions students took in DIAGNOSER over a period of
approximately two years. These sets included questions in the subjects of motion of objects,
nature of forces, forces to explain motion, and human body systems. We considered only
those question sets that have been taken at least 200 times, and removed question set
responses from experimenters or teachers. Students occasionally retake question sets after
study; we included these responses in our analysis. Explaining 2D motion was taken the
least (Nâ•›=â•›200) and position and distance 1 was taken by the most students (Nâ•›=â•›5540). In all,
there were over forty-thousand completed question sets (Nâ•›=â•›40237).
In the DIAGNOSER, some questions are offered conditionally; only students who obtain
a certain response on one question are presented with the next. For pair-wise analysis, we
considered only the subset of students who complete each pair of questions. In addition,
some questions allow a diagnosis of “unknown,” meaning that the facet represented by
that response is unknown (e.g., when students type in a numerical response that does not
correspond to a postulated reasoning strategy). Because unknown facets have no corre-
sponding instructional interventions, we limited our pair-wise analysis to question pairs
where both responses have been diagnosed with known facets through multiple-choice
responses. For cluster analysis, missing questions and unknown responses are part of the
response pattern.
The final subset of data represents student responses in well-developed and tested
modules where consistent relationships between pairs of diagnosed responses are most
likely to be found. We examine relationships among questions completed by each student
within each question set, which is intended to take about 20â•›min to complete during a
class. Therefore, there is little chance that instruction will affect the consistency of incor-
rect response at this timescale.
2.5
Maximum information gain Conditional entropy
1.5
Entropy (bit)
0.5
0
an eD tan e1
D ct 2
D ecti n1
AvSpe 2
Sp ed2
Ch nge pee 1
an Sp d2
Accele pee 1
ce rat d2
at n1
n2
Fo G avi orc s1
Fo es vit ion 2
es er ion 1
Fo Inte acti al2
Fo ces act ns1
es sta s2
El anc 1
ec e2
c1
pl in DM pe 1
Ef inin 1D oti d2
Ef cts D oti 1
ct s ot 2
sP ls1
ls2
ig O t1
ig ch t2
hS 1
2
us sP n1
ge ire ce
a vS ed
S d
rc ra tat es
rc Int at al
ist ce
Ex ain g1 tS ed
fe g2 M on
fe Pu M on
ec et
et
D igI e
D Me Se
G D rce
rc Di ion
Ch ng Dis anc
tri
Ac ge ee
e
ir io
et o
ler io
io
sP he io
he ul
ul
D OS
M S
D n
r r o
Ch A e
pl in an pe
Fo
s t
Ex in st tS
I
r F
Ch Po Dis
ig
ID
pl o tan
D
I
s
Po
Ex ingCons
a g
n
a
ain gC
a
pl in
Ex lain
p
Ex
Figure 30.2
The conditional entropy (in bits) and corresponding information gain for the question pair within each question
set that had the maximum information gain.
information gain (T (XY)) for the question pair within the set that had the maximum
information
� gain. By Equation 30.8, the sum of the conditional entropy and information
gain (the height of each bar) is the entropy for question Y. The information gain is quite
small, in bits, meaning that for the most highly predictable pair of questions in each set,
knowing the student’s response to the first question did not significantly help predict the
response to the second question. In the extreme, if the response to the first question of
the pair completely predicts the response to the second question, the conditional entropy
would be zero.
Ultimately, however, we are less interested in the maximal information gain than in the
conditional entropy. Low conditional entropy indicates that the responses to a question are
highly determined by the responses to another question.
Figure 30.3 shows the conditional entropy (in bits) and the corresponding information
gain for the question pair (X,Y) in each question set with the smallest conditional entropy.
The conditional entropy H (Y|X) added to the information gain T (XY) is the entropy for Y.
With a few exceptions, the smallest conditional entropy is not very low. Where it is low, the
entropy of the question is low in the first place (e.g., where students largely choose a single
response because the question is very easy or the distracters are not attractive).
One question pair, in the set Explaining2DMotion1, stands out in both Figures 30.2 and
30.3 with a low conditional entropy and high information gain. This indicates that the
first question in the pair is an excellent predictor of responses to the second question of
the pair. This occurred because the second question of the pair asks students to explain
their reasoning for their response, providing good cues for consistent responses. This is
1.6
Minimum conditional entropy Information gain
1.4
1.2
1
Entropy (bit)
0.8
0.6
0.4
0.2
0
n1 n2
Fo ces avi tio es2
es er io l1
an 1
2
Ch A pe d2
Ac eler pee 1
ce ati d2
ra 1
n2
pl in D pe 1
Ef inin 1D oti d2
Ef ctsP DMotio 1
ct sh ti 2
us Pu 1
D sPu 1
2
D chS t1
D IOS t2
IO t1
t2
Ch ng Spe d1
Ac nge pee 2
r r io 2
tio ctio
ist ce
ist ce
ce
c S d
le on
Ex ain g1 tS ed
fe g2 M on
fe u o n
sP es on
s
lls
a eS ed
ric
rc Int tat na
e e
ig e
ig e
Se
rc Di ion
he ll
Av pee
a g M e
a v e
tio
c
M S
G ID orc
r r ta c
D n
sD tan
pl in tan pe
ct
ire ire
ig ch
S
Ex lainons ntS
S
F
Po Dis
D e
et
eD geD
E
ID
M
D
ta
g
s
ig
an an
Po
Ex gC s
r
Ch Ch
ni o
ai gC
pl in
n
p
Ex lain
p
Ex
Figure 30.3
The conditional entropy (in bits) and the corresponding information gain for the question pair in each question
set with the smallest conditional entropy.
an example of a question pair where the content of the questions is closely related. There
is also an alternative: “I do not see my reasoning in this list” that was chosen very fre-
quently, but eliminated from analysis because it did not correspond to a known diagnosis.
We see high information gain in this case only because we consider only a small subset of
responses.
In summary, we saw little evidence from entropy-based pair-wise analysis that incorrect
responses successfully predict other responses beyond that which would be expected by
random chance. There are several possible explanations for this finding. First, although the
questions are judged by content experts to be related in a facet cluster, it may be that stu-
dents do not reason consistently until they gain a certain level of understanding. Evidence
for this has been provided by research on a similar dataset [10]. Second, as we will explore
in the next section, consistent reasoning strategies may span multiple questions.
0.35 0.35
Category utility function
0.25 0.25
0.2 0.2
ID forces 1
0.15 0.15 ID forces 2
Forces as interactions 1
0.1 0.1 Forces as interactions 2
Gravitational 1
0.05 0.05 Gravitational 2
Electric 1
0 0
2 3 4 5 6 7 8 2 3 4 5 6 7 8
Number clusters Number clusters
(a) Description of motion (b) Nature of forces
0.35 0.35
0.3 0.3
Category utility function
0.25 0.25
0.2 0.2
0.15 0.15
Effects pushes pulls 1
Effects pushes pulls 2 Digestive I/O 1
0.1 0.1
Explaining constant speed 1 Digestive I/O 2
Explaining constant speed 2 Digestive mech 1
0.05 Explaining 1D motion 1 0.05
Digestive mech 2
Explaining 1D motion 2
Explaining 2D motion 1
0 0
2 3 4 5 6 7 8 2 3 4 5 6 7 8
Number clusters Number clusters
(c) Forces to explain motion (d) Human body systems (digestive system)
Figure 30.4
Category utility function curves for DIAGNOSER question sets: (a) Description of motion, (b) nature of forces,
(c) forces to explain motion, (d) human body systems (digestive system).
response space so constrained, the five-cluster solution that was optimal had one cluster
where responses are mostly correct, three clusters each representing the most common
error to each of the questions with three or fewer responses, and one “entropy” cluster
with a high variation of responses.
Effects of Pushes and Pulls 2 has only six multiple-choice questions, four of which have
only three multiple-choice responses. Of the four clusters in the optimal solution, one clus-
ter was mostly correct, two clusters included the most common error to two of the ques-
tions with three multiple-choice responses, and one was an “entropy” cluster with a high
variation of responses.
The results of our cluster analysis do not support the existence of unique patterns of
thought. Rather, clusters form by ability levels.
50.00%
45.00%
40.00%
% Reduction in overall entropy
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
e1 e2 n1 n2 d2 d1 d2 d1 d2 n1 n2 t1 t2 t1 t2 1 2 1 2 1
es ns al al ic ls1 ls2 d2 n1 n2 n1
Se Se hSe hSe
d1
a nc anc ctio ctio pee pee pee pee pee atio atio rc tio tiontion ectr ul Pul eeotio otio otio
ee
t t O
I I O c c o P
s s p
is is ire ire etS vS vS eS eS ler ler ig ig e e F ac a a l S
Sp
sD sD D D D A A ang ang cce cce D D igm igm ID ter avit avit E he he nt M M M
us Pus ta g1D g1D g2D
nt
Po Po nge nge n r r s
ta
h h D D I P
s s n in in in
C C A A es G G ct ct
ns
a a
Ch Ch rc fe fe Co in in in
Co
Fo Ef Ef i ng pla pla pla
ng
n x x x
lai E E E
ni
ai
xp
pl
Question set Ex E
Figure 30.5
Percentage reduction in overall entropy for best clustering.
30.5╇ Discussion
We have found little evidence from either pair-wise analysis of questions or clustering
analysis that students, on the whole, have systematic misconceptions that can be diag-
nosed by the patterns of facets within the time frame of a single assessment. We are not
the first to notice this. This result is consistent with work by Tatsuoka [15] describing the
inconsistency with which students use procedural rules to solve arithmetic problems until
they reach mastery (i.e., use the correct rules). Hallhoun and Hestenes [16] noted that the
common sense conceptual systems of students have much less internal coherence than
naïve theories, which, although incorrect, are internally consistent (e.g., Aristotelian rea-
soning). Facets are in fact designed to capture fine-grained and inconsistent fragments of
knowledge and reasoning [2], similar to ideas proposed by diSessa [17], although the pro-
posed interventions assume facets are actually evidence of consistent reasoning. Thus, this
analysis suggests that student reasoning is fragmented and inconsistent, within a broad
representative sample of 6–12 graders taking physics in middle or high school.
These results have serious implications for pedagogical intervention. Predictable
responses to diagnostic questions allow teachers to use instructional strategies that ask
students to commit to ideas and then engage in experiments to test their hypotheses.
Tutoring systems can provide feedback based on an underlying model of incorrect stu-
dent understanding. In the extreme, if a student answers every question completely ran-
domly, there is no pattern to the responses, and no information to be exploited. Absence
References
1. M.H. Schneps, P.M. Sadler, S. Woll, and L. Crouse, A Private Universe, South Burlington, VT:
Annenberg Media, 1989.
2. J. Minstrell, Facets of students’ thinking: Designing to cross the gap from research to standards-
based practice, in Designing for Science: Implications for Professional, Instructional, and Everyday
Science, K. Crowley, C.D. Schunn, and T. Okada (eds.), Mawah, NJ: Erlbaum, 2001.
3. E. Hunt and J. Minstrell, Effective instruction in science and mathematics: Psychological prin-
ciples and social constraints, Issues in Education, 2, 1996, 123–162.
4. K. VanLehn, C. Lynch, K. Schulze, J.A. Shapiro, R. Shelby, L. Taylor, D. Treacy, A. Weinstein,
and M. Wintersgill, The Andes physics tutoring system: Lessons learned, International Journal of
Artificial Intelligence in Education, 15, 2005, 147–204.
5. L. Razzaq, M. Feng, G. Heffernan, K.R. Koedinger, B. Junker, S. Ritter, A. Knight, C. Aniszczyk,
S. Choksey, T. Livak, E. Mercado, T. Turner, R. Upalekar, J.A. Walonoski, M.A. Macasek, and
K.P. Rasumssen, The assistment project: Blending assessment and assisting, in Proceedings of
the 12th International Conference on Artificial Intelligence in Education, C.K. Looi, G. McCalla, B.
Bredeweg, and J. Breuker (eds.), Amsterdam, the Netherlands: ISO Press, 2005, pp. 555–562.
6. FACET Innovations, Welcome to Diagnoser: Instructional Tools for Science and Mathematics. www.
diagnose.com.
7. J. Minstrell, R. Anderson, P. Kraus, and J.E. Minstrell, Bridging from practice to research and
back: Tools to support formative assessment, in Science Assessment: Research and Practical
Approaches, J. Coffey, R. Douglas, and C. Sterns (eds.), NSTA Press, Arlington, VA, 2008.
Mining for Patterns of Incorrect Response in Diagnostic Assessment Data 439
8. E. Hunt and J. Minstrell, The DIAGNOSER project: Formative assessment in the service of learn-
ing. Paper presented at the International Association for Educational Assessment, Philadelphia,
PA, 2004.
9. C. Huang, Psychometric analysis based on evidence-centered design and cognitive science of
learning to explore student’s problem-solving in physics, unpublished dissertation, December
2003.
10. K. Scalise, T. Madhyastha, J. Minstrell, and M. Wilson, Improving assessment evidence in
e-learning products: Some solutions for reliability, International Journal of Learning Technology, in
press.
11. D. Barbará, Y. Li, and J. Couto, COOLCAT: An entropy-based algorithm for categorical cluster-
ing, in Proceedings of the 11th International Conference on Information and Knowledge Management,
McLean, VA: ACM, 2002, pp. 582–589.
12. P. Andritsos, P. Tsaparas, R.J. Miller, and K.C. Sevcik, LIMBO: Scalable clustering of categorical
data, in Advances in Database Technology (EDBT 2004), Crete, Greece, 2004, pp. 531–532.
13. B. Mirkin, Reinterpreting the category utility function, Machine Learning, 45(2), November 2001,
219–228.
14. L.A. Goodman and W. Kruskal, Measures of association for cross classifications, Journal of the
American Statistical Association, 49, 1954, 732–764.
15. K.K. Tatsuoka, M. Birenbaum, and J. Arnold, On the stability of students’ rules of operation for
solving arithmetic problems, Journal of Educational Measurement, 26, 1989, 351–361.
16. J.A. Halloun and D. Hestenes, Common sense concepts about motion, American Journal of
Physics, 53, 1985, 1056–1065.
17. A.A. diSessa, Knowledge in Pieces, University of California, Berkeley, CA, 1985.
18. D. Sleeman, A.E. Kelly, R. Martinak, R.D. Ward, and J.L. Moore, Studies of diagnosis and reme-
diation with high school algebra students, Cognitive Science, 13, July 2005, 551–568.
31
Machine-Learning Assessment of Students’ Behavior
within Interactive Learning Environments
Manolis Mavrikis
Contents
31.1 Introduction......................................................................................................................... 441
31.2 Background..........................................................................................................................442
31.2.1 The Interactive Learning Environment WaLLiS................................................442
31.2.2 Datasets....................................................................................................................442
31.2.3 Employing Bayesian Networks for Student Modeling......................................443
31.3 Assessing Students’ Behavior...........................................................................................444
31.3.1 Predicting the Necessity of Help Requests.........................................................444
31.3.2 Predicting the Benefit of Students’ Interactions.................................................446
31.4 Application and Future Work........................................................................................... 447
References...................................................................................................................................... 449
31.1╇ Introduction
Enhancing an interactive learning environment (ILE) with personalized feedback within
and between activities necessitates several intelligent capabilities on behalf of its adap-
tive components. Such capabilities require means of assessing students’ behavior within
the ILE.
This chapter presents a case study of the use of educational data-mining (EDM) tech-
niques to assess the quality of students’ interaction in terms of learning, and subsequently
to predict whether they can answer a given question without asking for help. Such an
approach, based on data, provides an objective and operational, rather than intuitive or ad
hoc, measure and can empower intelligent components of an ILE to adapt and personalize
the provided feedback based on students’ interactions. The next section, after describing
briefly the ILE (WaLLiS) used as a test bed for the research presented here, further moti-
vates the necessity for such capabilities on behalf of the system. The section also presents
the datasets and provides background and justification for the machine-learning method,
Bayesian network (BN), which was primarily employed.
Section 31.3 presents details of the development of two interrelated BN, their accuracy
and comparisons with other techniques and, in particular, with decision trees. The last
section discusses the use of the developed models in WaLLiS and raises issues around the
application of such models relevant to EDM researchers in general. In addition, it presents
future work that provides insights to improve and automate the development of similar
models in related work.
441
442 Handbook of Educational Data Mining
31.2╇ Background
31.2.1╇ The Interactive Learning Environment WaLLiS
WaLLiS is a web-based environment that hosts content that includes theory or example
pages, as well as interactive and exploratory activities [1]. Apart from the typical com-
ponents of the system that deliver the material and the tree-based navigation of the
content, WaLLiS provides adaptive feedback that takes into account students’ errors and
common misconceptions. The feedback is designed to help them progress and complete
the activities. This way, a problem that students cannot solve is turned into a teach-
ing aid from which they learn by practicing. In that sense, WaLLiS is similar to other
Intelligent Tutoring Systems (ITS) (e.g., [2]), which have the potential to improve student
learning.
However, research on students’ interactions within ILE shows that their behavior is
neither always optimal nor beneficial to their learning [3,4]. In particular, the analysis of
students’ working within WaLLiS indicated that students’ help seeking behavior played a
particular role in learning [4]. The importance of help seeking is further supported by the
studies of human–human and classroom interactions (e.g., [5]), as well as other research in
ITS [6,7]. Moreover, a combination of qualitative research and statistical analysis indicates
that part of the evidence that human tutors employ to adapt their pedagogical strategies
comes particularly from help requests for questions on which the tutors estimate that a stu-
dent’s request for help is superfluous [4,8]. This estimation is mostly based on the quality
of previous interactions. Therefore, in principle, an operational way to assess the quality
of students’ interaction in terms of learning as well as a way to predict whether students
can answer a given question without asking for help, seems an important requirement for
an intelligent component of an ILE. The main aim of the case study presented here was to
develop such a component and to investigate its use.
31.2.2╇ Datasets
WaLLiS is integrated in the teaching of various courses of the School of Mathematics of
the University of Edinburgh. The datasets that facilitated the research discussed here are
from the second year module “Geometry Iteration and Convergence” (GIC) and, in par-
ticular, its last part that introduces “Conic Sections.” Although students are familiar with
the system from other courses, the materials taught in this part are unknown to them
and constitute a rather individual unit, which is taught solely through WaLLiS. This way,
for the data analysis reported here, it was possible to establish not only the indicators of
prerequisite knowledge but also that any performance results are reasonably (if not solely)
attributed to the interaction with the system and not other external factors. In particular,
with the collaboration of the lecturer, certain questions on the students’ final exam were
designed to test long-term retention.
Accordingly, it was possible to conduct machine-learning analysis based on datasets
collected through the application of WaLLiS over 3 years (2003–2005). These are referred to
as GIC03, GIC04, and GIC05. The students who interacted with the system were 126, 133,
and 115, respectively. The GIC04 and GIC05 data collection aimed at collecting, apart from
students’ interactions, their performance. This was assessed by averaging (a) the students’
marks on an impromptu assessment they had to complete right after their interaction with
Machine-Learning Assessment of Students’ Behavior within ILE 443
the system and (b) their mark on a final exam purposefully designed to probe learning
that can be attributed to the system.
It is evident that data collection under realistic conditions entails several challenges that
result in discarding some data. Due to the way the datasets were collected (i.e., remotely
over the internet and not during a controlled laboratory study), they can be quite noisy.
The methods used for data collection are subject to bandwidth availability, appropriate
security settings and other client and server-side concerns (see [9]). In addition, some stu-
dents did not consent to their data being recorded. For other students, it could not be
established whether they were familiar with the system, so their data were discarded since
their behavior was quite different. After this data cleaning process, the GIC03, GIC04, and
GIC05 datasets comprised 106, 126, and 99 students, respectively.
Table 31.1
Classification Accuracy, Kappa Statistic and Recall Values
for BN and Decision Trees for Predicting Superfluous Help
Request in Any Activity
BayesNet J4.8
Cross Test Set Cross Test Set
Accuracy 67.64 66.52 65.84 64.05
Kappa 0.317 0.318 0.30 0.23
Recall True 0.60 0.56 0.59 0.5
False 0.74 0.76 0.71 0.72
Table 31.2
Average Classification Accuracy, Kappa Statistic and Recall
Values for BNs and Decision Trees to Superfluous Help
Requests Per Activity
BayesNet J4.8
Cross Test Set Cross Test Set
Accuracy 69.12 67.61 67.84 63.05
Kappa 0.36 0.35 0.34 0.27
Recall True 0.74 0.71 0.72 0.66
False 0.62 0.62 0.60 0.58
Table 31.3
Features Considered for Learning the Model of Beneficial Interaction
1. Help frequency
2. Error frequency
3. Tendency to ask for help rather than risk an error, i.e., help/errors+help as in [26]
4. No need for help but help requested (according to the previous section) (true/false)
5. Answertype—the type of the answer required (mcq, blank, matrix, checkbox)
6. Previous attempts in items related to the current skill
• If this was the student’s first opportunity to practice this skill, −1.
• If no previous attempt was successful, 0.
• Otherwise, a measure of the degree of completeness of the goals of related items (if there were no
related items then the standardized score of their exam in prerequisite of this skill)
7. Time in standard deviations off the mean time taken by all students on the same item.
8. Speed between hints—The Mahalanobis distance of the vector of times between hints from the vector
of mean times taken by all students on the same hints and item3
9. Accessing the related example while answering (true/false)
10. Self-correction (true/false)
11. Requested solution without attempt to answer (true/false)
12. Reflection on hints (defined as the time until next action from hint delivery) (calculated as in point 8,
using the Mahalanobis distance)
13. The number of theoretical material lookups that the student followed when such a lookup was
suggested by the system (−1 if no lookups were suggested)
who did, provided wrong answers). The final set of data contains Table 31.4
472 instances from GIC04 and 352 from GIC05. Variables After Feature
For similar reasons as the ones presented in the previous sec- Selection from Table 31.1
tion, BNs were preferred. To facilitate the algorithm’s search, 1. Tendency for help
feature selection is employed in advance to remove irrelevant 2. Need for help
and redundant features. Although (as expected) similar struc- 3. Self-correction
tures and accuracies are learned, whether the full set of data are 4. Example access
employed or not, simpler models are always preferred. In fact, the 5. Average reflection time
6. Speed for hints
simplified model achieved slightly better accuracy on a 10-fold 7. Error frequency
cross-validation check and slightly better accuracy on the test set.
By removing redundant features, the remaining ones were easier
to comprehend. This allowed a more informed ordering of the variables, which can affect
the search for the structure of the ICS algorithm.
The list of reduced variables is shown in Table 31.4 and the final model in Figure 31.2.
Its accuracy report, as well as comparisons with decision trees are presented in Table 31.5.
errors selfCorrection
effectiveness
Figure 31.2
BN for predicting beneficial interaction.
Table 31.5
Classification Accuracy and Kappa Statistic for BN and Tree
Induction to Predict Beneficial Interaction
BayesNet J4.8
Cross Test Set Cross Test Set
Accuracy 70.11 68.23 66.52 65.843
Kappa 0.40 0.36 0.318 0.313
Recall True 0.72 0.73 0.714 0.686
False 0.67 0.62 0.605 0.626
with the system. Therefore, speed of learning was not relevant. However, speed becomes
important when automating the learning of the model to occur at run-time while students
are interacting with the ILE. Feature selection, therefore, becomes relevant since, by reduc-
ing the available features, it speeds up the learning process significantly.
The Bayesian models were integrated in the system.* Their output is utilized by a diag-
nostic agent, the outcomes of which are taken into account by the feedback mechanism in
order to adapt its actions (for more details the reader is referred to [4]). As already men-
tioned, the accuracy of the models was considered adequate for implementation. However,
as it should be the case in all EDM approaches, such decisions should not be taken solely
on the grounds of accuracy but should be based on a careful balance of the likely educa-
tional consequences of any incorrect predictions.
In the case of WaLLiS and the model of unnecessary help requests, particular importance
was given to the recall of instances classified as FALSE (i.e., where the model would predict
that a student does not need help). The high values indicate less false negatives (i.e., less
cases where a student needs help but the model predicts that they do not). For the model of
beneficial interaction, we were interested in high recall of instances classified as TRUE to
have as less true negatives as possible (i.e., less cases where the model would �predict that
the student’s interaction is beneficial when it is not). Nevertheless, it was decided to take
* JavaBayes http://www.cs.cmu.edu/javabayes/is used in WaLLiS to query the models that result from the
offline processing from WEKA.
Machine-Learning Assessment of Students’ Behavior within ILE 449
an approach as unobtrusive and as less preventive as possible. If one takes into account
that in some cases a model is wrong, it is obvious, but paramount, to use these predictions
in a way that has the fewest negative educational consequences. Accordingly, when stu-
dents solicit suggestions on what to study next, the prediction of beneficial interaction is
employed assisting the system to prioritize the available items. In other cases, the predic-
tion is employed for providing suggestions about which items to study next or for interact-
ing again with the same item. As for the prediction of unnecessary help requests, this is
not employed directly (e.g., to stop students from asking for help) but rather through the
model of beneficial interaction and in other models of affect prediction [22].
The aforementioned decisions may seem specific to the case study presented here
but they are relevant to designers of systems that take into account outcomes of EDM
research. Similar issues are raised in [3,6,13]. The challenge is to strike a balance between
an approach that utilizes the predictions from the models in an informative way and a
more preventative approach that may be required in some cases. When highly uncertain
machine-learned decisions are taken, it may be better, in some circumstances, not to com-
municate this information with the student directly. Models such as the ones developed
here could have other applications. For example, the prediction of the benefit of the inter-
action could be used in open-learner modeling or in classroom environments to provide
useful information for teachers on the basis of which they can adapt their own teaching.
Future work will focus on improving the accuracy of models and automating the process.
In addition, by providing access to designers and other stakeholders of the ILE (e.g., lectur-
ers), they can also query the model and draw some conclusions about the nature of the
students’ interaction with the system. It is clear that approaches such as the ones presented
here have the potential to contribute to a deeper understanding of complex behaviors. Apart
from improving the adaptivity of the intelligent components of the system, machine-learned
models of students’ behavior could also help in the system’s redesign by targeting, for exam-
ple, the types of actions that fail to lead to a beneficial interaction.
References
1. Mavrikis, M. and Maciocia, A., WALLIS: A web-based ILE for science and engineering students
studying mathematics, in Workshop of Advanced Technologies for Mathematics Education in 11th
International Conference on Artificial Intelligence in Education, Sydney Australia, 2003.
2. Koedinger, K., Anderson, J., Hadley, W., and Mark, M., Intelligent tutoring goes to school in the
big city, International Journal of Artificial Intelligence in Education 8, 30–43, 1997.
3. Baker, R. S., Corbett, A. T., Koedinger, K. R., and Wagner, A. Z., Off-task behavior in the cog-
nitive tutor classroom: When students “Game The system,” in Proceedings of ACM CHI 2004:
Computer-Human Interaction, Vienna, Austria, pp. 383–390, 2004.
4. Mavrikis, M., Modelling students’ behaviour and affective states in ILEs through educational
data mining, PhD thesis, The University of Edinburgh, Edinburgh, U.K., 2008.
5. Karabenick, S. A., Strategic Help Seeking Implications for Learning and Teaching, Lawrence Erlbaum
Associates, Mahwah, NJ, 1988.
6. Aleven, V. and Koedinger, K. R., Limitations of student control: Do students know when they
need help? in Proceedings of Fifth International Conference on Intelligent Tutoring Systems, Montreal,
Canada, pp. 292–303, 2000.
7. Aleven, V., McLaren, B., Roll, I., and Koedinger, K. R., Toward tutoring help seeking: Applying
cognitive modeling to meta-cognitive skills, in Intelligent Tutoring Systems, MaceiÓ, Brazil,
pp. 227–239, 2004.
450 Handbook of Educational Data Mining
8. Porayska-Pomsta, K., Mavrikis, M., and Pain, H., Diagnosing and acting on student affect: The
tutor’s perspective, User Modeling and User-Adapted Interaction 18 (1), 125–173, 2008.
9. Mavrikis, M., Logging, replaying and analysing students’ interactions in a web-based ILE to
improve student modelling, in Artificial Intelligence in Education: Supporting Learning through
Intelligent and Socially Informed Technology (Proceedings of the 12th International Conference
on Artificial Intelligence in Education, AIED2005), Looi, C., McCalla, G., Bredeweg, B., and
Breuker, J. (eds.), IOS Press, Amsterdam, the Netherlands, Vol. 125, p. 967, 2005.
10. Morales, R., Van Labeke, N., and Brna, P., Approximate modelling of the multi-dimensional
learner, in Proceedings of the Eighth International Conference on Intelligent Tutoring Systems, Jhongli,
Taiwan, pp. 555–564, 2006.
11. Arroyo, I., Murray, T., Woolf, B., and Beal, C., Inferring unobservable learning variables from
students’ help seeking behavior, in Proceedings of the Seventh International Conference on Intelligent
Tutoring Systems, Maceio, Brazil, pp. 782–784, 2004.
12. Corbett, A. and Anderson, J., Student modeling and mastery learning in a computer-based
programming tutor, in Proceedings of the Second International Conference on Intelligent Tutoring
Systems, Montreal, Canada, pp. 413–420, 1992.
13. Mayo, M. and Mitrovic, A., Using a probabilistic student model to control problem difficulty,
in Proceedings of Fifth International Conference on Intelligent Tutoring Systems, Montreal, Canada,
pp. 524–533, 2000.
14. Conati, C., Gertner, A., Vanlehn, K., and Druzdzel, M., On-line student modeling for coached
problem solving using Bayesian networks, in User Modeling: Proceedings of the Sixth International
Conference, Jameson, A., Paris, C., and Tasso, C. (eds.), Sardinia, Italy, 1997.
15. Collins, J., Greer, J., and Huang, S., Adaptive assessment using granularity hierarchies and
bayesian nets, in Proceedings of the Third International Conference on Intelligent Tutoring Systems,
Montreal, Canada, pp. 569–577, 1996.
16. Jonsson, A., John, J., Mehranian, H., Arroyo, I., Woolf, B., Barto, A., Fisher, D., and Mahadevan, S.,
Evaluating the feasibility of learning student models from data, in AAAI Workshop on Educational
Data Mining, Pittsburgh, PA, 2005.
17. Ferguson, K., Arroyo, I., Mahadevan, S., Woolf, B., and Barto, A., Improving intelligent tutor-
ing systems: Using expectation maximization to learn student skill levels, in Proceedings of the
Eighth International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, pp. 453–462, 2006.
18. Anderson, J. R., Corbett, A. T., Koedinger, K. R., and Pelletier, R., Cognitive tutors: Lessons
learned, The Journal of the Learning Sciences 4 (2), 167–207, 1995.
19. Vanlehn, K. and Martin, J., Evaluation of an assessment system based on Bayesian student
modeling, International Journal of Artificial Intelligence in Education 8 (2), 179–221, 1998.
20. Bouckaert, R. R., Bayesian networks in Weka, Technical Report 14/2004, Computer Science
Department, University of Waikato, Hamilton, New Zealand, 2004.
21. Verma, T. and Pearl, J., An algorithm for deciding if a set of observed independencies has a
causal explanation, in Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence,
Stanford, CA, pp. 323–330, 1992.
22. Mavrikis, M., Maciocia, A., and Lee, J., Towards predictive modelling of student affect from
web-based interactions, in Artificial Intelligence in Education: Building Technology Rich Learning
Contexts that Work (Proceedings of the 13th International Conference on Artificial Intelligence in
Education, AIED2007), Luckin, R., Koedinger, K., and Greer, J. (eds.), Los Angeles, CA, IOS
Press, Amsterdam, the Netherlands, Vol. 158, pp. 169–176, 2007.
23. Bunt, A. and Conati, C., Probabilistic student modelling to improve exploratory behaviour,
User Modeling and User-Adapted Interaction 13, 269–309, 2003.
24. Mahalanobis, P. C., On the generalized distance in statistics, Proceedings of Natural Institute of
Sciences 2, 49–55, 1936.
25. De Maesschalck, R., Rimbaud, J., and Massart, D. L., The Mahalanobis distance, Chemometrics
and Intelligent Laboratory Systems 50, 1–18, 2000.
26. Wood, H., Help seeking, learning and contingent tutoring, Computers & Education 33, 2–3, 1999.
32
Learning Procedural Knowledge from
User Solutions to Ill-Defined Tasks in
a Simulated Robotic Manipulator
Contents
32.1 Introduction......................................................................................................................... 451
32.2 The CanadarmTutor Tutoring System.............................................................................. 452
32.3 A Domain Knowledge Discovery Approach for the Acquisition of
Domain Expertise............................................................................................................... 453
32.3.1 Step 1: Recording Users’ Plans..............................................................................454
32.3.2 Step 2: Mining a Partial Task Model from Users’ Plans.................................... 455
32.3.3 Step 3: Exploiting the Partial Task Model to Provide Relevant
Tutoring Services.................................................................................................... 457
32.3.3.1 Assessing the Profile of a Learner......................................................... 458
32.3.3.2 Guiding the Learner................................................................................ 458
32.3.3.3 Letting Learners Explore Different Ways of Solving Problems........ 458
32.4 Evaluating the New Version of CanadarmTutor............................................................ 459
32.5 Related Work....................................................................................................................... 460
32.5.1 Other Automatic or Semiautomatic Approaches for Learning
Domain Knowledge in ITS.................................................................................... 460
32.5.2 Other Applications of Sequential Pattern Mining in E-Learning................... 462
32.6 Conclusion........................................................................................................................... 462
Acknowledgments.......................................................................................................................463
References......................................................................................................................................463
32.1╇ Introduction
Domain experts should provide relevant domain knowledge to an intelligent tutoring sys-
tem (ITS) so that it can assist a learner during problem-solving activities. There are three
main approaches for providing such knowledge. The first one is cognitive task analysis that
aims at producing effective problem spaces or task models by observing expert and novice
users [1] to capture different ways of solving problems. However, cognitive task analysis
is a very time-consuming process [1] and it is not always possible to define a complete or
partial task model, in particular when a problem is ill-structured. According to Simon [2],
an ill-structured problem is one that is complex, with indefinite starting points, multiple
and arguable solutions, or unclear strategies for finding solutions. Domains that include
451
452 Handbook of Educational Data Mining
such problems and in which tutoring targets the development of problem-solving skills
are said to be ill-defined (within the meaning of Ashley and coworkers [3]). Constraint-
based modeling (CBM) was proposed as an alternative [4]. It consists of specifying sets of
constraints on what is a correct behavior, instead of providing a complete task description.
Though this approach was shown to be effective for some ill-defined domains, a domain
expert has to design and select the constraints carefully. The third approach consists of
steps to integrate an expert system into an ITS [8]. However, developing an expert system
can be difficult and costly, especially for ill-defined domains.
Contrarily to these approaches where domain experts have to provide the domain
knowledge, a promising approach is to use knowledge-discovery techniques to automati-
cally learn a problem space from logged user interactions in an ITS, and to use this knowl-
edge base to offer tutoring services. A few efforts have been made in this direction in
the field of ITS [5,6,17,21–23]. But they have all been applied in well-defined domains. As
an alternative, in this chapter, we propose an approach that is specially designed for ill-
defined domains. The approach has the advantage of taking into account learner profiles
and does not require any form of background knowledge. The reader should note that a
detailed comparison with related work is provided in Section 32.5.
This chapter is organized as follows. First, we describe the CanadarmTutor tutoring
system and previous attempts for incorporating domain expertise into it. Next, we pres-
ent our approach for extracting partial problem spaces from logged user interactions and
describe how it enables CanadarmTutor to provide more realistic tutoring services. Finally,
we compare our approach with related work, discuss avenues of research, and present
conclusions.
FIGURE 32.1
The CanadarmTutor User Interface.
task-analysis approach [9]. Although we described high-level rules such as how to select cam-
eras and set their parameters in the correct order, it was not possible to go into finer detail
to model how to rotate the arm joint(s) to attain a goal configuration. The reason is that for
a given robot manipulation problem, there is a huge number of possibilities for moving the
robot to a goal configuration and because one must also consider the safety of the maneuvers
and their easiness, it is very difficult to define a “legal move generator” for generating the
moves that a human would execute. In fact, some joint manipulations are preferable to oth-
ers, depending on several criteria that are hard to formalize such as the view of the arm given
by the chosen cameras, the relative position of obstacles to the arm, the arm configuration
(e.g., avoiding singularities), and the familiarity of the user with certain joint manipulations
over others. It is, thus, not possible to define a complete and explicit task model for this task.
Hence, CanadarmTutor operates in an ill-defined domain as defined by Simon.
The CBM approach [4] may represent a good alternative to the cognitive task analysis
approach. However, in the CanadarmTutor context, it would be very difficult for domain
experts to describe a set of relevance and satisfaction conditions that apply to all situations.
In fact, there would be too many conditions and still a large number of solutions would
fit the conditions for each problem. Moreover, the CBM approach is useful for validating
solutions. But it cannot support tutoring services such as suggesting next problem-solving
steps to learners, which is a required feature in CanadarmTutor.
of a procedural learning domain. Our approach consists of the following steps. Each time
a user attempts to solve a problem, the tutoring system records the attempt as a plan in a
database of user solutions. Then humans can display, edit, or annotate plans with con-
textual information (e.g., required human skills to execute the plan and expertise level of
those skills). Thereafter, a data-mining algorithm is applied to extract a partial task model
from the user plans. The resulting task model can be displayed, edited, and annotated
before being taken as input by the tutoring system to provide tutoring services. This whole
process of recording plans to extract a problem space can be performed periodically to
make the system constantly improve its domain knowledge. Moreover, staff members can
intervene in the process or it can be fully automated. The next sections present in detail
each step of this process and how they were applied in CanadarmTutor to support relevant
tutoring services.
TABLE 32.1
An MD-Database Containing Six User Solutions
Dimensions
ID Solution State Expertise Skill_1 Skill_2 Skill_3 Sequence
S1 Successful Novice Yes Yes Yes 〈(0,a), (1,b c)〉
S2 Successful Expert No Yes No 〈(0,d)〉
S3 Buggy Novice Yes Yes Yes 〈(0,a), (1,b c)〉
S4 Buggy Intermediate No Yes Yes 〈(0,a), (1,c), (2,d)〉
S5 Successful Novice No No Yes 〈(0,d), (1,c)〉
S6 Successful Expert No No Yes 〈(0,c), (1,d)〉
TABLE 32.2
Some Frequent Patterns Extracted from the Dataset of Table 32.1 with minsupâ•›=â•›33%
Dimensions
ID Solution State Expertise Skill_1 Skill_2 Skill_3 Sequence Supp.
P1 * Novice Yes Yes Yes 〈(0,a)〉 33%
P2 * * * Yes Yes 〈(0,a)〉 50%
P3 * Novice Yes Yes Yes 〈(0,a), (1,b)〉 33%
P4 Successful * No * * 〈(0,d)〉 50%
P5 Successful Novice * * Yes 〈(0,c)〉 33%
P6 Successful Expert No * No 〈(0,d)〉 33%
include age, educational background, learning styles, cognitive traits, and emotional state,
assuming that the data is available. In CanadarmTutor, we had 10 skills, and used the
“solution state” and “expertise level” dimensions to annotate sequences (see Section 32.4
for more details).
of patterns [24]. For this work, we developed a custom SPM algorithm [15] that combines
several features from other algorithms such as accepting time constraints [10], processing
databases with dimensional information [11], and mining a compact representation of all
patterns [14,16], and that also adds some original features such as accepting symbols with
parameter values. We have built this algorithm to address the type of data to be recorded
in a tutoring system offering procedural exercises such as CanadarmTutor. The main idea
of that algorithm will be presented next. For a technical description of the algorithm, the
reader can refer to [15].
The algorithm takes as input an MD-Database and some parameters and find all
MD-Sequences occurring frequently in the MD-Database. Here, sequences are action
sequences (not necessarily contiguous in time) with timestamps as defined in Section
32.3.1. A sequence saâ•›=â•›〈(ta1, A1), (ta2, A2), …, (tan, An)〉 is said to be contained in another
sequence sbâ•›=â•›〈(tb1, B1), (tb2, B2), …, (tbn, Bm)〉, if there exists integers 1≤â•›k1â•›<â•›k2â•›< ∙â•›∙â•›∙ <â•›kn ≤ m
such that A1 ⊆ Bk1, A2 ⊆ Bk2, …, An ⊆ Bkn and that tbkjâ•›−â•›tbk1 is equal to takjâ•›−â•›tak1 for each j ∈
1 … n (recall that in this work timestamps of successive events are successive integers. e.g.,
0, 1, 2…). Similarly for MD-Patterns, an MD-Pattern Pxâ•›=â•›dx1, dx2, …, dxp is said to be con-
tained in another MD-Pattern Pyâ•›=â•›dy1, dy2, …, dyp if for each i ∈ 1 … p, dyiâ•›=â•›“*” or dyiâ•›=â•›dxi
[11]. The relative support of a sequence (or MD-Pattern) in a sequence database is defined
as the percentage of sequences (or MD-Patterns) from D, which contain it. The problem of
mining frequent MD-Sequences is to find all the MD-sequences such that their support is
frequent, defined as greater or equal than minsup for an MD-Database D, given a support
threshold minsup. As an example, Table 32.2 shows some patterns that can be extracted
from the MD-Database of Table 32.1, with a minsup of two sequences (33%). Consider pat-
tern P3. This pattern represents doing action b one time unit (immediately) after action a.
The pattern P3 appears in MD-Sequences S1 and S3. It has thus a support of 33% or two
MD-Sequences. Because this support is higher or equal to minsup, P3 is frequent. Moreover,
the annotations for P3 tell us that this pattern was performed by novice users who possess
skills “Skill_1,” “Skill_2,” and “Skill_3,” and that P3 was found in plan(s) that failed, as
well as plan(s) that succeeded.
In addition, as in [10], we have incorporated in our algorithm the possibility of specify-
ing time constraints on mined sequences such as minimum and maximum time inter-
vals required between the head and tail of a sequence and minimum and maximum time
intervals required between two successive events of a sequence. In CanadarmTutor, we
mine only sequences of length two or greater, as shorter sequences would not be useful
in a tutoring context. Furthermore, we chose to mine sequences with a maximum time
interval between two successive events of two time units. The benefits of accepting a gap
of two is that it eliminates some “noisy” (non-frequent) learners’ actions, but at the same
time, it does not allow a larger gap size that could make patterns less useful for tracking
a learner’s actions.
Another important consideration is that when applying SPM, there can be many redun-
dant frequent sequences found. For example, in Table 32.2, the pattern “*, novice, yes, yes, yes
〈(0,a)〉” is redundant as it is included in the pattern “*, novice, yes, yes, yes 〈(0,a), (1,b)〉” and it has
exactly the same support. To eliminate this type of redundancy, we have adapted our algo-
rithm to mine only frequent closed sequences. “Closed sequences” [14,16,24] are sequences that
are not contained in another sequence having the same support. Mining frequent closed
sequences has the advantage of greatly reducing the number of patterns found, without
information loss (the set of closed frequent sequences allows reconstituting the set of all
frequent sequences and their support) [14]. To mine only frequent closed sequences, our
SPM algorithm was extended based on [14] and [16] to mine closed MD-Sequences (see [15]).
Learning Procedural Knowledge 457
Once patterns have been mined by our SPM algorithm, they form a partial problem
space that can be used by a tutoring agent to provide assistance to learners, as described
in the next section. However, we also provide a simple software program for displaying
patterns, editing them, or adding annotations.
RecognizePlan(Student_trace, Patterns)
Result:= Ø.
FOR each pattern P of Patterns
IF Student_trace is included in P
Resultâ•›=â•›Result ∪ {P}.
IF Result:= Ø AND length(Student_trace) ≥ 2
Remove last action of Student_trace.
Result:= RecognizePlan(Student_trace, Patterns).
Return Result.
After many users perform the same exercise, CanadarmTutor extracts sequential patterns
from (1) sequences of problem states visited and (2) sequences of actions performed for
going from a problem state to another. To take advantage of the added notion of problem
states, we modified RecognizePlan so that at every moment, only the patterns performed in
the current problem state are considered. To do so, every time the problem state changes,
RecognizePlan will be called with the set of patterns associated with the new problem state.
Moreover, at a coarser-grain-level tracking, the problem states visited by the learners is
also achieved by calling RecognizePlan. This allows connecting patterns for different prob-
lem states. We describe next the main tutoring services that a tutoring agent can provide
based on the plan-recognition algorithm.
provides sorting and filtering functions (e.g., to display only patterns leading to success).
However, the learner could be assisted in this exploration by using an interactive dialog
with the system, which could prompt them on their goals and help them go through the
patterns to achieve these goals. This tutoring service could be used when the tutoring sys-
tem wants to prepare students before involving them in real problem-solving situations.
Goal1
Goal2
(a) (b)
FIGURE 32.2
(a) The two scenarios. (b) A hint offered by CanadarmTutor.
460 Handbook of Educational Data Mining
It should be noted that although the sequential pattern algorithm was applied, only one
time in this experiment after recording each learner’s plan, it would be possible to make
CanadarmTutor apply it more often, in order to continuously update its knowledge base,
while interacting with learners.
and problems and solutions, both annotated with concept instances from the ontology.
CAS then extracts constraints directly from the restrictions that are defined in the ontol-
ogy. CAS was successfully applied for the domains of fraction addition and entity-relation
diagram modeling. But the success of CAS for a domain depends directly on whether it
is possible to build an ontology that is appropriate and detailed enough, and this is not
necessarily the case for all domains. In fact, ontology modeling can be difficult. Moreover,
as previously mentioned, constraint-based tutors can validate solutions. But they cannot
suggest next problem-solving steps to learners.
A last method that we mention is the work of Barnes and Stamper [17], which was applied
for the well-defined domain of logic proofs. The approach of Barnes and Stamper consists
of building a Markov decision process containing learner solutions for a problem. This is
a graph acquired from a corpus of student problem-solving traces where each state repre-
sents a correct or erroneous state and each link is an action to go from one state to another.
Then, given a state, an optimal path can be calculated to reach a goal state according to
desiderata such as popularity, lowest probability of errors, or shortest number of actions.
An optimal path found in this way is then used to suggest to a learner the next actions
to perform. This approach does not require providing any kind of domain knowledge or
extra information during demonstrations. However, since all traces are stored completely
in a graph, the approach seems limited to domains where the number of possibilities is not
large. Moreover, their approach does not support more elaborated tutoring services such
as estimating the profile of a learner by looking at the actions that the learner applies (e.g.,
expertise level), and hints are not suggested based on the estimated profile of a learner.
We believe this to be a major limitation of their approach, as in many cases, an ITS should
not consider the “optimal solution” (as previously defined) as being the best solution for
a learner. Instead an ITS should select successful solutions that are adapted to a learner
profile, to make the learner progress along the continuum from novice to expert.
In summary, in addition to some specific limitations, all systems mentioned in this sec-
tion have one or more of the following limitations: they (1) require defining a body of
background knowledge [6,21–23], (2) have been demonstrated for well-defined domains
[5,6,17,21–23], (3) rely on the strong assumption that tasks can be modeled as production
rules [6,21,23], (4) do not take into account learner profiles [5,6,17,21–23], (5) learn knowledge
that is problem specific [5,17], and (6) require demonstrators to provide extra information
during demonstrations such as their intentions or labels for elements of their solutions [6,22].
Our approach is clearly different from these approaches as it does not possess any of
these limitations except for generating task models that are problem specific. We believe
that this limitation is an acceptable trade-off, as in many domains a particular collection of
exercises can be set up to be administered to many students. Also, it is important to note
that our approach allows manual annotation of sequences if needed.
The approach of Barnes et al. shares some similarities with our approach as it needs to
be applied for each problem, and authors are not required to provide any kind of domain
knowledge. But the approach of Barnes et al. also differs from ours in several ways. The
first important difference is that our approach extracts partial problem spaces from user
solutions. Therefore our framework ignores parts of learner solutions that are not frequent.
This strategy of extracting similarities in learners’ solutions allows coping with domains
such as the manipulation of Canadarm2 where the number of possibilities is very large,
and user solutions do not share many actions. In fact, our framework builds abstractions of
learners’ solutions, where the frequency threshold minsup controls what will be excluded
from these abstractions. A second important difference is that the abstractions created by
our framework are generalizations as they consist of subsequences appearing in several
462 Handbook of Educational Data Mining
learner solutions. This property of problem spaces produced by our framework is very
useful as it allows finding patterns that are common to several profiles of learners or con-
texts (e.g., patterns common to expert users who succeed in solving a problem, or common
to users possessing or lacking one or more skills). Conversely, the previously mentioned
approaches do not take into account the profile of learners who recorded solutions.
32.6╇ Conclusion
In this paper, we have presented an approach for domain knowledge acquisition in ITS,
and shown that it can be a plausible alternative to classic domain knowledge acquisi-
tion approaches, particularly for a procedural and ill-defined domain where classical
approaches fail. For discovering domain knowledge, we proposed to use an SPM algo-
rithm that we designed for addressing the type of data recorded in tutoring systems such
as CanadarmTutor. Since the proposed data mining framework and its inputs and outputs
are fairly domain independent, it can be potentially applied to other ill-defined procedural
domains where solutions to problems can be expressed as sequences of actions as defined
previously. With the case study of CanadarmTutor, we described how the approach can
Learning Procedural Knowledge 463
support relevant tutoring services. We have evaluated the capability of the new version
of CanadarmTutor to exploit the learned knowledge to provide tutoring services. Results
showed an improvement over the previous version of CanadarmTutor in terms of tracking
learners’ behavior and providing hints. In future work, we will perform further experi-
ments to measure empirically how the tutoring services influence the learning of students.
Because the problem spaces extracted with our approach are incomplete, we suggest
using our approach jointly with other domain knowledge acquisition approaches when
possible. In CanadarmTutor, we do so by making CanadarmTutor use the path-planner
for providing hints when no patterns are available. Also, we are currently working on
combining our data-mining approach with the rule-based model that we implemented
in another version of CanadarmTutor using the cognitive task analysis approach [9]. In
particular, we are exploring the possibility of using skills from the rule-based model to
automatically annotate recorded sequences used by the data-mining approach presented
in this paper.
We also plan to use association rule mining as in a previous version of our approach,
to find associations between sequential patterns over a whole problem-solving exercise
[18]. Mining association rules could improve the effectiveness of the tutoring services, as
it is complementary to dividing the problem into problem states. For example, if a learner
followed a pattern p, an association rule could indicate that the learner has a higher prob-
ability of applying another pattern q later during the exercise than some other pattern r
that is available for the same problem state.
Finally, we are interested in mining other types of temporal patterns in ITSs. We have
recently published work [19] that instead of mining sequential patterns from the behavior
of learners, mines frequent sequences from the behavior of a tutoring agent. The tutoring
agent then reuses sequences of tutorial interventions that were successful with learners.
This second research project is also based on the same algorithm described here.
Acknowledgments
Our thanks go to FQRNT and NSERC for their logistic and financial support. Moreover,
the authors would like to thank Severin Vigot, Mikael Watrelot, and Lionel Tchamfong
for integrating the framework in CanadarmTutor, the other members of the GDAC/
PLANIART teams who participated in the development of CanadarmTutor, and the anon-
ymous reviewers who provided many comments for improving this paper.
References
1. Aleven, V., McLaren, B. M., Sewall, J., and Koedinger, K. 2006. The Cognitive Tutor Authoring
Tools (CTAT): Preliminary evaluation of efficiency gains. In Proceedings of the Eighth International
Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, June 26–30, 2006, pp. 61–70.
2. Simon, H. A. 1978. Information-processing theory of human problem solving. In Handbook
of Learning and Cognitive Processes, Vol. 5. Human Information, W.K. Estes (Ed.), pp. 271–295.
Hillsdale, NJ: John Wiley & Sons, Inc.
464 Handbook of Educational Data Mining
3. Lynch, C., Ashley, K., Aleven, V., and Pinkwart, N. 2006. Defining ill-defined domains; A
literature
� survey. In Proceedings of the Intelligent Tutoring Systems for Ill-Defined Domains Workshop,
Jhongli, Taiwan, June 27, 2006, pp. 1–10.
4. Mitrovic, A., Mayo, M., Suraweera, P., and Martin, B. 2001. Constraint-based tutors: A suc-
cess story. In Proceedings of the 14th Industrial & Engineering Application of Artificial Intelligence &
Expert Systems, Budapest, Hungary, June 4–7, 2001, pp. 931–940.
5. McLaren, B., Koedinger, K.R., Schneider, M., Harrer, A., and Bollen, L. 2004. Bootstrapping nov-
ice data: Semi-automated tutor authoring using student log files. In Proceedings of the Workshop
on Analyzing Student-Tutor Logs to Improve Educational Outcomes, Maceiò, Alagoas, Brazil,
August 30, 2004, pp. 1–13.
6. Jarvis, M., Nuzzo-Jones, G., and Heffernan, N.T. 2004. Applying machine learning techniques to
rule generation in intelligent tutoring systems. In Proceedings of Seventh International Conference
on Intelligent Tutoring Systems, Maceiò, Brazil, August 30–September 3, 2004, pp. 541–553.
7. Kriatofic, A and Bielikova, M. 2005. Improving adaptation in web-based educational hyperme-
dia by means of knowledge discovery. In Proceedings of the 16th ACM Conference on Hypertext
and Hypermedia, Bratislava, Slovakia, September 6–10, 2005, pp. 184–192.
8. Kabanza, F., Nkambou, R., and Belghith, K. 2005. Path-planning for autonomous training on
robot manipulators in space. In Proceedings of International Joint Conference on Artificial Intelligence
2005, Edinburgh, U.K., July 30–August 5, 2005, pp. 35–38.
9. Fournier-Viger, P., Nkambou, R., and Mayers, A. 2008. Evaluating spatial representations and
skills in a simulator-based tutoring system. IEEE Transactions on Learning Technologies, 1(1):
63–74.
10. Hirate, Y. and Yamana, H. 2006. Generalized sequential pattern mining with item intervals,
Journal of Computers, 1(3): 51–60.
11. Pinto, H. et al. 2001. Multi-dimensional sequential pattern mining. In Proceedings of the 10th
International Conference on Information and Knowledge Management, Atlanta, GA, November 5–10,
2001, pp. 81–88.
12. Han, J. and Kamber, M. 2000. Data Mining: Concepts and Techniques. San Francisco, CA: Morgan
Kaufmann Publisher.
13. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the International
Conference on Data Engineering, Taipei, Taiwan, March 6–10, 1995, pp. 3–14.
14. Wang, J., Han, J., and Li, C. 2007. Frequent closed sequence mining without candidate mainte-
nance, IEEE Transactions on Knowledge and Data Engineering, 19(8):1042–1056.
15. Fournier-Viger, P., Nkambou, R., and Mephu Nguifo, E. 2008. A knowledge discovery frame-
work for learning task models from user interactions in intelligent tutoring systems. In
Proceedings of the Sixth Mexican International Conference on Artificial Intelligence, Atizapán de
Zaragoza, Mexico, October 27–31, 2008, pp. 765–778.
16. Songram, P., Boonjing, V., and Intakosum, S. 2006. Closed multidimensional sequential pat-
tern mining. In Proceedings of the Third International Conference Information Technology: New
Generations, Las Vegas, NV, April 10–12, 2006, pp. 512–517.
17. Barnes, T. and Stamper, J. 2008. Toward automatic hint generation for logic proof tutoring using
historical student data. In Proceedings of the Ninth International Conference on Intelligent Tutoring
Systems, Montreal, Canada, June 23–27, 2008, pp. 373–382.
18. Nkambou, R., Mephu Nguifo, E., and Fournier-Viger, P. 2008. Using knowledge discovery tech-
niques to support tutoring in an ill-defined domain. In Proceedings of the Ninth International
Conference on Intelligent Tutoring Systems, Montreal, Canada, June 23–27, 2008, pp. 395–405.
19. Faghihi, U., Fournier-Viger, P., and Nkambou, R. 2009. How emotional mechanism helps
episodic learning in a cognitive agent. In Proceedings of IEEE Symposium on Intelligent Agents,
Nashville, TN, March 30–April 2, 2009, pp. 23–30.
20. Perera, D., Kay, J., Koprinska, I., Yacef, K., and Zaiane, O. 2008. Clustering and sequential pat-
tern mining of online collaborative learning data. IEEE Transactions on Knowledge and Data
Engineering, 21(6): 759–772.
Learning Procedural Knowledge 465
21. Matsuda, N., Cohen, W., Sewall, J., Lacerda, G., and Koedinger, K. 2007. Performance with
SimStudent: Learning cognitive skills from observation. In Proceedings of Artificial Intelligence in
Education 2007, Los Angeles, CA, July 9–13, 2007, pp. 467–478.
22. Suraweera, P., Mitrovic, A., and Martin, B. 2007. Constraint authoring system: An empirical
evaluation. In Proceedings of Artificial Intelligence in Education 2007, Los Angeles, CA, July 9–13,
2007, pp. 467–478.
23. Blessing, S. B. 1997. A programming by demonstration authoring tool for model-tracing tutors.
International Journal of Artificial Intelligence in Education, 8: 233–261.
24. Yuan, D., Lee, K., Cheng, H., Krishna, G., Li, Z., Ma, X., Zhou, Y., and Han, J. 2008. CISpan:
Comprehensive incremental mining algorithms of closed sequential patterns for multi-�
versional software mining. In Proceedings of the Eighth SIAM International Conference on Data
Mining, Atlanta, GA, April 24–26, pp. 84–95.
25. Antunes, C. 2008. Acquiring background knowledge for intelligent tutoring systems. In
Proceedings of the Second International Conference on Educational Data Mining, Montreal, Canada,
June 20–21, pp. 18–27.
26. Romero, C., Gutiarrez, S., Freire, M., and Ventura, S. 2008. Mining and visualizing visited
trails in web-based educational systems. In Proceedings of the Second International Conference on
Educational Data Mining, Montreal, Canada, June 20–21, pp. 182–186.
27. Su, J.-M., Tseng, S.-S., Wang, W., Weng, J.-F., Yang, J. T. D., and Tsai, W.-N. 2006. Learning port-
folio analysis and mining for SCORM compliant environment. Educational Technology & Society,
9(1): 262–275.
33
Using Markov Decision Processes for
Automatic Hint Generation
Contents
33.1 Introduction......................................................................................................................... 467
33.2 Background.......................................................................................................................... 468
33.3 Creating an MDP-Tutor...................................................................................................... 470
33.3.1 Constructing the MDP from Data........................................................................ 471
33.3.2 The MDP Hint Generator...................................................................................... 473
33.4 Feasibility Studies............................................................................................................... 474
33.5 Case Study: The Deep Thought Logic MDP-Tutor......................................................... 476
33.6 Conclusions.......................................................................................................................... 478
References...................................................................................................................................... 478
33.1╇ Introduction
In this chapter, we present our novel application of Markov decision processes (MDPs)
for building student knowledge models from collected student data. These models can
be integrated into existing computer-aided instructional tools to automatically generate
context-specific hints. The MDP method is straightforward to implement in many edu-
cational domains where computer-aided instruction exists and student log data is being
generated, but no intelligent tutors have been created. In this chapter, we demonstrate the
creation of an MDP-Tutor for the logic domain, but the methods can be readily translated
to other procedural problem-solving domains. We first demonstrate how to create an MDP
to represent finding optimal solutions. We then show how to use this MDP to create a hint
generator, and illustrate a feasibility study to determine how often the generator can pro-
vide hints. This feasibility technique is a valuable contribution that can be used to make
decisions about the appropriateness of MDPs for hint generation in a particular problem-
solving domain. The chapter concludes with a discussion of the future of using MDPs
and other data-derived models for learning about and supporting student learning and
problem solving.
467
468 Handbook of Educational Data Mining
33.2╇ Background
Our goal is to create data-driven, domain-independent techniques for creating cognitive
models that can be used to generate student feedback, guidance, and hints, and help edu-
cators understand students’ learning. Through adaptation to individual learners, intelli-
gent tutoring systems (ITSs), such as those built on ACT-R, can have significant effects on
learning [1]. ACT-R Theory is a cognitive architecture that has been successfully applied
in the creation of cognitive tutors, and contains a procedural component that uses pro-
duction rules to model how students move between problem states [1]. Through a pro-
cess called model tracing, cognitive tutors find a sequence of rules that produce the
actions a student has taken. This allows for individualized help, tracks student mastery
(using the correctness of the productions being applied), and can provide feedback on
recognized errors.
In general, most of these ITSs include models that are domain or application specific,
and require significant investment of time and effort to build [22]. A variety of approaches
have been used to reduce the development time for ITSs, including ITS authoring tools
(such as ASSERT and CTAT), or building constraint-based student models instead of pro-
duction rule systems. ASSERT is an ITS authoring system that uses theory refinement to
learn student models from an existing knowledge base and student data [3]. Constraint-
based tutors, which look for violations of problem constraints, require less time to con-
struct and have been favorably compared to cognitive tutors, particularly for problems
that may not be heavily procedural [21]. However, constraint-based tutors can only pro-
vide condition violation feedback, not goal-oriented feedback that has been shown to be
more effective [28].
Some systems, including RIDES, DIAG, and CTAT use teacher-authored or demon-
strated examples to develop ITS production rules. In these example-based authoring tools,
domain experts work problems in what they predict to be frequent correct and incorrect
approaches, and then annotate the learned rules with appropriate hints and feedback.
RIDES is a “Tutor in a Box” system used to build training systems for military equipment
usage, while DIAG was built as an expert diagnostic system that generates context-specific
feedback for students [22]. CTAT has been used to develop example-based tutors for sub-
jects including genetics, Java, and truth tables [16]. CTAT researchers built SimStudent,
a machine learning agent that learns production rules by demonstration and is used to
simulate students [17]. CTAT has also been used with student data to build initial mod-
els for an ITS, in an approach called Bootstrapping Novice Data (BND) [19]. However,
expert example-based approaches cannot be easily generalized to learn from student data,
and considerable time must still be spent in identifying student approaches and creating
appropriate hints.
Machine learning has been used in a number of ways to improve tutoring systems, most
extensively in modeling students and in tutorial dialog. In the ADVISOR tutor, machine
learning was used to build student models that could predict the time students took to
solve arithmetic problems, and to adapt instruction to minimize this time while meeting
teacher-set instructional goals [8]. Jameson et al. similarly predict a user’s execution time
and errors based, in part, on the system’s delivery of instructions [15]. Mayo and Mitrovic
use a Bayesian network model and predict student performance, and use a decision-theo-
retic model to select appropriate tutorial actions in CAPIT [18]. The DT-Tutor models focus of
attention, affect, and student knowledge of tasks and domain rules to predict the topic and
correctness of a student’s next action, and uses this with decision theory to make decisions
Using Markov Decision Processes for Automatic Hint Generation 469
about tutor actions [23]. All of these systems use �probabilistic networks to model student
learning and behavior, and combine these models with �decision-theoretic approaches to
make choices for the next tutorial actions.
Other applications use educational data mining techniques to discover patterns of
�student behavior. Baker has used educational data mining on labeled tutor log data to dis-
cover detrimental gaming behaviors to avoid learning in ITSs [4]. Soller uses hand-labeled
interactions to train a Hidden Markov Model classifier to distinguish effective interactions
in collaborative learning tasks from ineffective ones [26]. Amershi and Conati have applied
both unsupervised and supervised learning to student interaction and eye-tracking data
in exploratory learning environments to scaffold building models of effective student
learning behavior [2], and applied their educational data mining model to two different
learning environments. Other educational data mining approaches discover patterns by
clustering student test scores or solution data labeled by correctness, and present these
patterns to educators [14,24].
MDPs have been used specifically to model tutorial interactions, primarily for dialogue,
but also to model how to offer intelligent help in more general software applications. Chi,
VanLehn, and colleagues have built MDPs to model the dialogue acts of human tutors,
and compared them with their Greedy-RL tutor to investigate what tactics (such as elicit or
tell actions) are most effective for supporting learning [9]. Hui and colleagues use MDPs to
model decisions for a simulated help assistant, balancing the costs of interaction including
factors such as bloat, visual occlusion, information processing, and savings [13]. In con-
trast, our MDPs are used as models of target student behavior and use them to generate
hints for students.
Deep Thought [10] and the Logic-ITA [20] are two CAI tools to support teaching and
learning of logic proofs. These tutors verify proof statements as a student enters them, and
provide feedback on clear errors. Logic-ITA provides students with performance feedback
upon proof completion, and considerable logging and teacher feedback [20]. Logic-ITA
student data was also mined to create hints that warned students when they were likely to
make mistakes using their current approach [20]. Another logic tutor called the Carnegie
Proof Lab uses an automated proof generator to provide contextual hints [25]. We have
augmented Deep Thought with a cognitive architecture derived using educational data
mining, that can provide students feedback to avoid error-prone solutions, find optimal
solutions, and inform students of other student approaches.
Similar to the goal of BND, we seek to use student data to directly create model trac-
ers for an ITS. However, instead of feeding student behavior data into CTAT to build a
production rule system, we directly build data-derived model tracers (DMTs) for use in
tutors. To do this, we have generated MDPs that represent all student approaches to a par-
ticular problem, and use these MDPs directly to generate hints. This method of automatic
hint generation using previous student data reduces the expert knowledge needed to
generate intelligent, context-dependent hints. The system is capable of continued refine-
ment as new data is provided. In this chapter, we demonstrate the feasibility of our hint-
generation approach through simulation experiments on existing student data and pilot
studies that show that the automatically generated hints help more students solve harder
proof problems.
Barnes and Stamper used visualization tools to explore how to generate hints based on
MDPs extracted from student data [5]. Croy, Barnes, and Stamper applied the technique to
visualize student proof approaches to allow teachers to identify problem areas for students
[10,11]. Our feasibility study for hint generation on simulated students indicated valuable
trade-offs between hint specificity and the amount of source data for the MDP [6].
470 Handbook of Educational Data Mining
Our method of automatic hint generation using previous student data reduces the
expert knowledge needed to generate intelligent, context-dependent hints and feedback,
and provides the basis for a more general DMT that learns as it is used. Although our
approach is currently only appropriate for model tracing and generating hints for specific
problems with existing prior data, we believe that machine learning applied to MDPs
may be used to create automated rules and hints for new problems in the same domain.
We illustrate our approach by applying MDPs to support student work in solving formal
logic proofs.
MDP-Tutors work best in situations where CAI already exists, problems consist of a series
of related steps and are most often solved using similar strategies, and log data is readily
available. Many CAI tools exist for math and science problems, which often have multiple
related steps but involve a limited number of problem-solving strategies. For the best hint
generation, it is best if errors and correct solutions are labeled or detected. However, the
MDP method can be used to analyze problems and build probabilistic MDP-Tutors with-
out this information [7].
We have created and describe here our MDP-Tutor for the logic proofs domain, but other
researchers have applied our techniques to generate student feedback as well. Fossati and
colleagues have used MDPs to model student behavior in the iList linked list tutor, and use
these models to choose tutorial feedback such as that given by human tutors [12]. The tutor
represents linked lists in a graphical form that students can manipulate in order to gain a
better understanding of the concepts without the need for large amounts of programming.
The iList tutor uses the MDP method to provide proactive feedback, indicating whether
the student’s current approach is likely to be successful.
Figure 33.1
Deep Thought Logic Tutor with problem 3.6 partially completed.
from the screen. We therefore save this as a state in the MDP, but as a “leaf” where it is
understood that the error state is entered and then the interface is restored to the prior
state. In other CAI tools where errors may persist in the problem-solving environment,
either because they are not detected or not removed, error states could be incorporated
into the MDP just as other states are. However, the retention of erroneous information in
a state must be carefully considered when generating hints about what a student should
do next. When the reward and expected reward values are assigned, as described below, it
is important to penalize only the states in which an error originally occurs, and not those
that follow it.
Once state representations are determined and student attempt graphs are constructed,
we combine these attempts into a single graph that represents all of the paths students
have taken in working a proof, by taking the union of all states and actions, and mapping
identical states to one another. We also create an artificial “goal” state and connect the last
state in each successful problem solution to this goal state. Next, value iteration is used to
find an optimal solution to the MDP. For the experiments in this work, we assign a large
reward to the goal state (100) and negative rewards (penalties) to incorrect states (−10)
and actions (−1), resulting in a bias toward short, correct solutions such as those an expert
might work. We have also proposed two other reward functions to deliver different types
of hints [4], including functions that
1. Find frequent, typical paths. To derive this policy, we assign high rewards to suc-
cessful paths that many students have taken. Vygotsky’s theory of the zone of
proximal development [29] states that students are able to learn new things that
are closest to what they already know. Presumably, frequent actions could be those
that more students feel fluent using, and may be more understandable than expert
solutions.
2. Find least error-prone paths. To derive this policy, we assign large penalties to
error states, which lower the overall rewards for paths near many errors.
After setting the chosen rewards, we apply the value iteration using a Bellman backup
to iteratively assign values V(s) to all states in the MDP until the values on the left- and
right-hand sides of Equation 33.1 converge [27]. V(s) corresponds to the expected reward
for following an optimal policy from state s. The equation for calculating values V(s) is
given in equation (1), where R(s,a) is the reward for taking action a from state s, and Pa(s, s′)
is the probability that action a will take state s to state s′. In our case, Pa(s, s′) is calculated
by dividing the number of times action a is taken from state s to s′ by the total number of
actions leaving state s.
V (s) := max R(s, a) +
a
∑ P (s, s′) V(s′)
s′
a (33.1)
Once value iteration is complete, the values for each state indicate how close to the goal
a state is, while probabilities of each transition reveal the frequency of taking a certain
action in a certain state. The optimal solution in the MDP from any given state can then be
reached by choosing an action to the next state with the highest value.
Using Markov Decision Processes for Automatic Hint Generation 473
Table 33.1
Deep Thought Hint Sequence Template and Example Data
Hint Hint Text (Source Data BOLD) Hint Type
1 Try to derive not N working forward Indicate goal expression
2 Highlight if not T then not N and not T to derive it Indicate the premises to select
3 Click on the rule Modus Ponens (MP) Indicate the rule to use
4 Highlight if not T then not N and not T and click on Bottom-out hint
Modus Ponens (MP) to get not N
Source: Barnes, T. J. et al., A pilot study on logic proof tutoring using hints generated from his-
torical student data, in Baker, R. S. J. d., Barnes, T., and Beck, J. E. (Eds.) Educational Data
Mining 2008: 1st International Conference on Educational Data Mining, Proceedings.
Montreal, Quebec, Canada. June 20–21, 2008, pp. 197–201. Available online at: http://
www.educationaldatamining.org/EDM2008/index.php?page=proceedings
474 Handbook of Educational Data Mining
Table 33.2
Comparison of % Move Matches Across # Source Semesters
and Matching Techniques
Matching 1-sem. MDPs 2-sem. MDPs 3-sem. MDPs
Ordered 72.79% 79.57% 82.32%
Unordered 79.62% 85.22% 87.26%
Ordered-1 79.88% 87.84% 91.57%
Unordered-1 85.00% 91.50% 93.96%
Source: Barnes, T. and Stamper, J., J. Educ. Technol. Soc., Special Issue
on Intelligent Tutoring Systems, 13(1), 3, February 2010. With
permission.
Using Markov Decision Processes for Automatic Hint Generation 475
Table 33.3
Algorithm for One Trial of the Cold Start Experiment to Test
How Quickly Hints Become Available
1. Let Testâ•›=â•›{all 523 student attempts}.
2. Randomly choose and remove the next attempt a from the Test set.
3. Add a’s states and recalculate the MDP.
4. Randomly choose and remove the next attempt b from the Test set.
5. Compute the number of matches between b and MDP.
6. If test is nonempty, then let a:=b and go to step 3. Otherwise, stop.
The second study was a simulation of creating MDPs incrementally as students work
proofs, and calculating the probability of being able to generate hints as new attempts are
added to the MDP. This experiment explores how quickly such an MDP is able to provide
hints to new students, or in other words, how long it takes to solve the cold start problem.
For one trial, the method is given in Table 33.3.
For this experiment, we used the ordered and unordered matching functions, and
�plotted the resulting average matches over 100,000 trials. As shown in Figure 33.2, unor-
dered matches have a higher number of matches, but for both ordered and unordered
matching functions, the number of move matches rises quickly, and can be fit using power
functions.
Table 33.4 lists the number of attempts needed in the MDP versus target hint percent-
ages. For the unordered matching function, the 50% threshold is reached at just eight
100%
90%
80%
70%
60%
50%
40%
30%
20% Unordered Ordered
10%
0%
0 50 100 150 200 250 300 350 400 450 500
Student attempts
Figure 33.2
Percentage of moves with hints available as attempts are added to the MDP. (From Barnes, T. and Stamper, J.,
J. Educ. Technol. Soc., Special Issue on Intelligent Tutoring Systems, 13(1), 3, February 2010. With permission.)
Table 33.4
Number of Attempts Needed to Achieve Threshold % Hints Levels
50% 55% 60% 65% 70% 75% 80% 85% 90%
Unordered 8 11 14 20 30 46 80 154 360
Ordered 11 15 22 33 55 85 162 362 ?
Source: Barnes, T. and Stamper, J., J. Educ. Technol. Soc., Special Issue on Intelligent Tutoring
Systems, 13(1), 3, February 2010. With permission.
476 Handbook of Educational Data Mining
student attempts and the 75% threshold at 49 attempts. For ordered matching, 50% occurs
on attempt 11 and 75% on attempt 88. These data are encouraging, suggesting that instruc-
tors using our MDP hint generator could seed the data to provide hint generation for new
problems. By allowing the instructor to enter as few as 8–11 example solutions to a prob-
lem, the method could automatically generate hints for 50% of student moves.
Together, the two feasibility studies presented here provided us with evidence that using
an MDP to automatically generate hints could be used with just one semester of data for a
particular problem. The quality of the hints is assured by our collaborative construction of
hints with domain experts, but as described in the next section, we evaluated the effects
of our automatically generated hints with students using the Deep Thought MDP-Tutor.
Table 33.5
Problem Descriptions for Deep Thought Problems
Problem 3.2 3.5 3.6 3.8
Attempts 16 22 26 25
Average length 11.3 18.8 8.0 11.9
Std deviation length 4.0 12.4 5.9 3.1
Min length 7 10 4 7
Max length 18 58 30 19
Expert length 6 8 3 6
Errors 0.9 3.1 1.1 1.1
Time to solve 4:25 9:58 3:23 6:14
Using Markov Decision Processes for Automatic Hint Generation 477
100%
90% Source
80% Hints
70%
60%
50%
40%
30%
20%
10%
0%
3.6 3.8 3.2 3.5
Figure 33.3
Percent attempt completion between the source and hints groups. (From Barnes, T. J. et al., A pilot study on logic
proof tutoring using hints generated from historical student data, in Baker, R. S. J. d., Barnes, T., and Beck, J. E.
(Eds.) Educational Data Mining 2008: 1st International Conference on Educational Data Mining, Proceedings. Montreal,
Quebec, Canada. June 20–21, 2008, pp. 197–201. Available online at: http://www.educationaldatamining.org/
EDM2008/index.php?page=proceedings)
study, hints were delivered for 91% of hint requests. This suggests that hints are needed
precisely where we have data in our MDPs from previous semesters. We plan to investi-
gate the reasons for this surprising result with further data analysis and experiments. It is
possible that there are a few key places where many students need help. Another explana-
tion is that, when students are performing actions that have not been taken in the past,
they may have high confidence in these steps.
Based on these results (see Table 33.6), we conclude that the hint generator is particularly
helpful for harder problems such as 3.5, where students might give up if hints were not
available. The hint generator may also encourage lower performing students to persevere
in problem-solving.
Table 33.6
Hint Usage and Availability by Problem, Including All
Solution Attempts in Spring 2008
Problem 3.2 3.5 3.6 3.8
Attempts 69 57 44 46
% Moves w/ avail. hints 44.2% 45.8% 51.2% 48.7%
% Hints delivered 90.3% 91.4% 94.3% 92.2%
Average # hints 6.82 8.03 3.57 6.63
Min # hints 0 0 0 0
Max # hints 22 44 17 37
Std deviation # hints 2.44 3.68 2.03 1.88
Source: Barnes, T. J. et al., A pilot study on logic proof tutoring using
hints generated from historical student data, in Baker, R. S.
J. d., Barnes, T., and Beck, J. E. (Eds.) Educational Data Mining
2008: 1st International Conference on Educational Data Mining,
Proceedings. Montreal, Quebec, Canada. June 20–21, 2008,
pp. 197–201. Available online at: http://www.educational╉
datamining.org/EDM2008/index.php?page=proceedings
478 Handbook of Educational Data Mining
33.6╇ Conclusions
In this chapter, we have demonstrated how to augment a CAI with a data-driven automatic
hint generator to create an MDP-Tutor for logic proofs. We have also discussed our feasibil-
ity studies to verify sufficient hint availability, and a pilot study showing that augmenting
the DT-tutor with hints helps more students complete difficult logic proof problems. Using
these methods, MDP-Tutors can be created from CAIs in procedural problem-solving
domains. Our approach has been used by other researchers in a linked lists tutor called
iList, where the MDPs generated based on student work were used to select hints like
those given by human tutors, including warning students when their current action is
unlikely to lead to a successful problem solution [12].
There are a number of new directions we are exploring with the method. As discussed
above, we plan to explore using different reward functions for different types of students
when generating hints, including (1) expert, (2) typical, and (3) least error prone. The
reward function we have used herein reflects an expert reward function, where the value
for a state reflects the shortest path to the goal state. Alternatively, when the “Hint” button
is pressed, we could select a personalized reward function for the current student based
on their student profile. If we have identified the student as an at-risk student, we may
select the “least error-prone” reward function for generating hints. On the other hand,
high-performing students would likely benefit from expert hints, while students between
these two extremes may benefit from hints reflecting typical student behavior. If there is
sufficient data, we can create separate MDPs for students in particular groups, such as
high, low, and medium performers on a previous exercise, learning styles, GPA, or other
factors, and use these to generate personalized hints. These hints would be contextualized
both within the problem and by student characteristics. We also envision using MDPs to
create higher level user models that could be applied to a variety of different problems,
and to generate statistics to be used in knowledge tracing.
References
1. Anderson, J. and Gluck, K. 2001. What role do cognitive architectures play in intelligent tutor-
ing systems? In D. Klahr and S. Carver (eds.), Cognition & Instruction: 25 Years of Progress,
pp. 227–262, Erlbaum, Mahwah, NJ.
2. Amershi, S. and Conati, C. 2009. Combining unsupervised and supervised machine learning
to build user models for exploratory learning environments. Journal of Educational Data Mining,
1(1): 18–71.
3. Baffes, P. and Mooney, R.J. 1996. A novel application of theory refinement to student modeling.
Proceedings of the AAAI-96, pp. 403–408, Portland, OR, August 1996.
4. Baker, R., Corbett, A.T., Roll, I., and Koedinger, K.R. 2008. Developing a generalizable detector
of when students game the system. User Modeling and User-Adapted Interaction, 18 (3): 287–314.
5. Barnes, T. and Stamper, J. 2007. Toward the extraction of production rules for solving logic
proofs. In C. Heiner, T. Barnes, and N. Heffernan (eds.), Proceedings of the 13th International
Conference on Artificial Intelligence in Education, Educational Data Mining Workshop (AIED2007),
pp. 11–20, Los Angeles, CA.
6. Barnes, T. and Stamper, J. 2009. Automatic hint generation for logic proof tutoring using histori-
cal data. Journal of Educational Technology & Society, Special Issue on Intelligent Tutoring Systems,
13(1), 3–12, February 2010.
Using Markov Decision Processes for Automatic Hint Generation 479
7. Barnes, T. and Stamper, J. 2009. Utility in hint generation: Selection of hints from a corpus of
student work. In V. Dimitrova, R. Mizoguchi, B. du Boulay, and A.C. Graesser (eds.), Proceedings
of 14th International Conference on Artificial Intelligence in Education, AIED 2009, pp. 197–204, IOS
Press, Brighton, U.K., July 6–10, 2009.
8. Beck, J., Woolf, B.P., and Beal, C.R. 2000. ADVISOR: A machine learning architecture for intel-
ligent tutor construction. Seventh National Conference on Artificial intelligence, pp. 552–557, AAAI
Press/The MIT Press, Menlo Park, CA.
9. Chi, M., Jordan, P.W., VanLehn, K., and Litman, D.J. 2009. To elicit or to tell: Does it matter? In V.
Dimitrova, R. Mizoguchi, B. du Boulay, and A.C. Graesser (eds.), Proceedings of 14th International
Conference on Artificial Intelligence in Education, AIED 2009, pp. 197–204, IOS Press, Brighton,
U.K., July 6–10, 2009.
10. Croy, M. 2000. Problem solving, working backwards, and graphic proof representation. Teaching
Philosophy, 23(2): 169–187.
11. Croy, M., Barnes, T., and Stamper, J. 2007. Towards an intelligent tutoring system for propo-
sitional proof construction. In P. Brey, A. Briggle, and K. Waelbers (eds.), Proceedings of 2007
European Computing and Philosophy Conference, Enschede, the Netherlands, IOS Publishers,
Amsterdam, the Netherlands.
12. Fossati, D., Di Eugenio, B., Ohlsson, S., Brown, C., Chen, L., and Cosejo, D. 2009. I learn from
you, you learn from me: How to make iList learn from students. In V. Dimitrova, R. Mizoguchi,
B. Du Boulay, and A. Graesser (eds.), Proceedings of 14th International Conference on Artificial
Intelligence in Education, AIED 2009, IOS Press, Brighton, U.K., July 6–10, 2009.
13. Hui, B., Gustafson, S., Irani, P., and Boutilier, C. 2008. The need for an interaction cost model
in adaptive interfaces. In Proceedings of the Working Conference on Advanced Visual Interfaces,
AVI ‘08, pp. 458–461, Napoli, Italy, May 28–30, 2008, ACM, New York.
14. Hunt, E. and Madhyastha, T. 2005. Data mining patterns of thought. In Proceedings of the AAAI
22nd National Conference on Artificial Intelligence Educational Data Mining Workshop (AAAI2005),
Pittsburgh, PA.
15. Jameson, A., Grossman-Hutter, B., March, L., Rummer, R., Bohnenberger, T., and Wittig, F. 2001.
When actions have consequences: Empirically based decision making for intelligent user inter-
faces. Knowledge-Based Systems, 14: 75–92.
16. Koedinger, K.R., Aleven, V., Heffernan. T., McLaren, B., and Hockenberry, M. 2004. Opening
the door to non-programmers: Authoring intelligent tutor behavior by demonstration. In
Proceedings of the 7th Intelligent Tutoring Systems Conference, pp. 162–173, Maceio, Brazil.
17. Matsuda, N., Cohen, W.W., Sewall, J., Lacerda, G., and Koedinger, K.R. 2007. Predicting stu-
dents performance with SimStudent that learns cognitive skills from observation. In R. Luckin,
K. R. Koedinger, and J. Greer (eds.), Proceedings of the International Conference on Artificial
Intelligence in Education, pp. 467–476, Los Angeles, CA, IOS Press, Amsterdam, the Netherlands.
18. Mayo, M. and Mitrovic, A. 2001. Optimising ITS behaviour with Bayesian networks and deci-
sion theory. International Journal of Artificial Intelligence in Education, 12: 124–153.
19. McLaren, B., Koedinger, K., Schneider, M., Harrer, A., and Bollen, L. 2004. Bootstrapping Novice
data: Semi-automated tutor authoring using student log files. In Proceedings of the Workshop on
Analyzing Student-Tutor Interaction Logs to Improve Educational Outcomes, Proceedings of the 7th
International Conference on Intelligent Tutoring Systems (ITS-2004), pp. 199–207, Maceió, Brazil.
20. Merceron, A. and Yacef, K. 2005. Educational data mining: A case study. In 12th International
Conference on Artificial Intelligence in Education, IOS Press, Amsterdam, the Netherlands.
21. Mitrovic, A., Koedinger, K., and Martin, B. 2003. A comparative analysis of cognitive tutor-
ing and constraint-based modeling. User Modeling, Lecture Notes in Computer Science, Vol. 2702,
pp. 313–322. Springer, Berlin, Germany.
22. Murray, T. 1999. Authoring intelligent tutoring systems: An analysis of the state of the art.
International Journal of Artificial Intelligence in Education, 10: 98–129.
23. Murray, R. C., VanLehn, K., and Mostow, J. 2004. Looking ahead to select tutorial actions: A
decision-theoretic approach. International Journal of Artificial Intelligence in Education, 14, 3,4
(December 2004), 235–278.
480 Handbook of Educational Data Mining
24. Romero, C., Ventura, S., Espejo, P.G., and Hervas, C. 2008. Data mining algorithms to classify
students. In Proceedings of the First International Conference on Educational Data Mining, pp. 8–17,
Montreal, Canada, June 20–21, 2008.
25. Sieg, W. 2007. The AProS project: Strategic thinking & computational logic. Logic Journal of
IGPL, 15(4): 359–368.
26. Soller, A. 2004. Computational modeling and analysis of knowledge sharing in collaborative
distance learning. User Modeling and User-Adapted Interaction, 14(4): 351–381.
27. Sutton, S. and Barto, A. 1998. Reinforcement Learning: An Introduction, MIT Press, Cambridge,
MA.
28. VanLehn, K. 2006. The behavior of tutoring systems. International Journal of Artificial Intelligence
in Education, 16(3): 227–265.
29. Vygotsky, L. 1986. Thought and language, MIT Press, Cambridge, MA.
34
Data Mining Learning Objects
Contents
34.1 Introduction......................................................................................................................... 481
34.2 Introduction: Formulation, Learning Objects................................................................ 482
34.3 Data Sources in Learning Objects.................................................................................... 482
34.3.1 Metadata...................................................................................................................483
34.3.2 External Assessments............................................................................................. 483
34.3.3 Information Obtained When Managing Learning Objects..............................484
34.4 The Learning Object Management System AGORA.....................................................484
34.5 Methodology....................................................................................................................... 485
34.5.1 Collect Data.............................................................................................................. 486
34.5.2 Preprocess the Data................................................................................................ 486
34.5.2.1 Select Data................................................................................................. 486
34.5.2.2 Create Summarization Tables................................................................ 486
34.5.2.3 Data Discretization.................................................................................. 487
34.5.2.4 Data Transformation................................................................................ 487
34.5.3 Apply Data Mining and Interpret Results.......................................................... 487
34.5.3.1 Clustering Algorithms............................................................................ 487
34.5.3.2 Classification Algorithms....................................................................... 487
34.5.3.3 Association Algorithms.......................................................................... 489
34.6 Conclusions.......................................................................................................................... 490
Acknowledgments....................................................................................................................... 490
References...................................................................................................................................... 490
34.1╇ Introduction
Data mining techniques are applied in education with several perspectives. Many of them
are oriented to facilitate students’ work or adapt existing systems to their needs. We are
interested to improve teachers’ and instructional designers’ activities. Learning objects
(LOs) are of great importance at present, since they are the building blocks of different
types of computer-based learning systems. In this chapter, a new approach of data mining
used in e-learning is presented. A method and tool to extract and process relevant data
from LOs is introduced. A major question is intended to be solved: which data must be
considered to be processed to discover important rules about usability of LOs in different
environments?
481
482 Handbook of Educational Data Mining
• Any entity, digital or non-digital, which can be used, reused, or referenced during
technology-supported learning [3].
• A structured and independent media resource that encapsulates high-quality
information to facilitate learning and pedagogy [4].
• Any digital resource that can be used as support for learning [5].
The definitions are diverse but, in general, all of them consider LOs as recyclable media content
items that have an educational purpose. We consider an LO as any digital entity developed
with instructional design intentions, which can be used, reused, or referenced for learning.
Many LOs are produced, but their availability cannot be guaranteed sometimes due to a
lack of metadata, or because they are not properly organized or categorized. Its reusability
is limited because the information about their content is not standardized and very often
is either absent or incomplete.
Metadata generation, evaluation, or search are frequent activities that can provide
relevant
� data about LOs in order to determine their use in different environments. It is
important, therefore, to provide methods and tools adapted to analyze LOs from their
educational
� dimensions points of view.
The use of data mining techniques has produced good results specially with models
that can assess student’s knowledge and skills from observation [6] or in automatic genera-
tion of hints for students using their historical data [7]. On the other hand, in studies [8,9]
geared to provide mechanisms for adjusting the systems, to the learner’s individual needs,
were presented.
This chapter presents the advances in knowledge discovery activity, applying the knowl-
edge discovery in databases (KDD) and, particularly, the data mining (DM) techniques to
the information that can be drawn from LOs. The following results are presented:
Efforts in this direction generally considered the objects only partially, especially for not
covering its multiple computational or pedagogical aspects as well as their design and use.
In recent years we have focused, among other things, to determine a set of attributes that
can represent main LOs characteristics as fully as possible. We propose a characterization
of LOs, based on three data categories to be processed:
• Metadata
• External assessments
• Information obtained when managing LOs
34.3.1╇ Metadata
Metadata contains primary and objective information about LOs. They represent a way
of characterizing introspective object’s analysis. Metadata standards facilitate reusabil-
ity, search, and adaptation to different environments. However, in many cases, metadata
are missing, incomplete, or poorly constructed. Main standards and specifications cover
�various aspects such as follows:
• Packaging (e.g., IMS CP [10] and SCORM Content Package 2004 [11])
• Labeling (e.g., IMS MD [12], IEEE-LOM [3])
• Sequencing (e.g., IMS Simple Sequencing [13])
SCORM 2004 is one of the most widely used packaging specifications. It is based on IMS
metadata, a profile application of IEEE Learning Object Metadata (LOM) specification.
LOM is a specification, encoded in XML, which describes a set of metatags used to rep-
resent metadata. These metadata are teaching and learning oriented, but they are insuf-
ficient to suit the needs of various educational systems.
In LOM, tags can be filled with two types of values: those for “controlled vocabulary”
and the “free text” values. Labels are formalized in an XML multi-scheme that implements
the specification and, in this way, LOM provides a mechanism for adjusting the specifica-
tion called “application profile,” which must meet the following restrictions:
• It must preserve data types and the element’s value areas of the basic scheme.
• It cannot define new data types or value areas for the added elements.
LOM-ES [14] is a Spanish application profile that contains several extensions, especially
new labels and vocabularies. It is used in LOs classification according to a set of rules
including taxonomies and thesaurus that permits to specify, among others, discipline, idea,
prerequisite, educational objective, accessibility restrictions, instructional level, or skills
level. LOM-ES supports the development of several important services such as the LOs
search and retrieval in federated repositories, or content previewing. We used LOM-ES for
data analysis including 15 attributes as aggregation, interactivity levels, semantic density,
structure, and language. Nevertheless, the general method proposed is also applicable
when using other schemes or specifications.
techniques, mostly based on expert assessments. Data obtained from LOs by this route are
mostly subjective and based on the use of questionnaires to measure impacts in learning
results.
There are no internationally recognized standards or schemes to assess the quality or
usability of LOs. Some systems are oriented to assess particular technological or pedagogi-
cal attributes.
Our proposal seeks to contribute in this regard. A methodology was developed and
also an instrument that involves the assessment of the most important aspects of learn-
ing objectives from the pedagogical point of view, as well as the technical features not
included in metadata.
The MECOA [15] (in Spanish means quality evaluation model for learning objects) con-
tains 22 attributes that include those relating to the interface, the current level of content,
the matching resources with the learning goal, the type of educational activity, or the
degree of difficulty in dealing with the content. This method is also applicable when using
other schemes for quality assessment.
The AGORA system [16], which is briefly presented in the following section, is an attempt
to provide an unified platform for the LOs management whose intention is to facilitate the
design and use of such resources by teachers and instructional designers.
Knowledge
How to develop a
learning object for Pedagogical User Rules,
this subject? ontology profile models
Query
Recommender Data mining
system Activity log
Answer/resource
Teacher
Learning LO External
object LO LO learning
Digital object
searchers
instructional
resources
Figure 34.1
AGORA architecture.
34.5╇ Methodology
We take as reference a methodology that contains the same stages contained in the general
process for data mining [17]; particularly, for e-learning applications we propose a data
mining methodology for course management systems (CMS) that focuses on the study of
learner’s interactions [8].
We propose a data mining process for LOs and its characteristics. This method contains
four stages with data: collect, preprocess, apply data mining and interpret, evaluate, and
deploy the results. We considered the AGORA platform as case of study (Figure 34.2).
486 Handbook of Educational Data Mining
Preprocess data
LO and
Collect AGORA metadata LO quality
usage data introspective evaluation
analysis
Result
LO analysis
management
AGORA Data base Applying data mining
Figure 34.2
Scheme learning objects KDD.
Interface describes competencies: well Interface describes competencies: very good Interface describes competencies: scantly
Knowledge transfer: by interpretations Knowledge transfer: by demonstrations Knowledge transfer: by problem solving
Designer΄s profile: Computing: advanced Designer΄s profile: Computing: average Designer΄s profile: Computing: average
Didactics: average Didactics: average Didactics: advanced
Design: average Design: average Design: advanced
Figure 34.3
Clustering results.
Various tests were verified with ID3 and J48 algorithms with the already mentioned
datasets. In accordance with the attribute’s main pedagogical impact from the expert’s
points of view, respective classification attributes were defined for LOs:
• Semantic density, whose labels are very low, low, medium, high, very high
• Knowledge transfer, whose labels are examples, demos, applications, interpreta-
tions, none
We obtained a set of IF-THEN-ELSE rules from the algorithms. After an analysis, we elimi-
nated those rules that were with irrelevant information. Tables 34.1 and 34.2 show some of
the best rules obtained.
Table 34.1
Some of the Best Rules Obtained with the ID3 Algorithm
Rules—Generated Rules—Interpretation
Table 34.2
Some of the Best Rules Obtained with the J48 Algorithm
Rules—Generated Rules—Interpretation
Table 34.3
Some of Best Rules Obtained with the APriori Algorithm
Reliability Rules—Generated Rules—Interpretation
0.98 versionâ•›=â•›final; environmentâ•›=â•›classroom If the LO consists of the final version and is used
=> aggregation levelâ•›=â•›1 in presencial learning, then aggregation level is
basic.
0.99 structureâ•›=â•›atomic; versionâ•›=â•›final; If the LO consists of atomic structure, final version,
environmentâ•›=â•›classroom => aggregation and is used in presencial learning, then
levelâ•›=â•›1 aggregation level is basic.
0.97 aggregation levelâ•›=â•›1; versionâ•›=â•›final; easy If the LO is basic and security navigation then it is
navigationâ•›=â•›security => used in presencial learning.
environmentâ•›=â•›classroom
0.98 versionâ•›=â•›final; levelâ•›=â•›vocational training If the LO consists of final version and is oriented
=> aggregation levelâ•›=â•›1 to professional training then aggregation level is
basic.
490 Handbook of Educational Data Mining
34.6╇ Conclusions
In this chapter, we presented an adapted methodology for the application of data min-
ing techniques to LOs, trying to discover relevant characteristics in its design and usage
characteristics.
For this method, the learning object management system AGORA was used for evalu-
ating LO data from three main perspectives: metadata, quality assessments, and several
management activities. Some tests were carried out using clustering and classification
algorithms. Particular details about processing LOs data were presented.
The method and results allow improving the work of teachers in designing and search-
ing, and also in the management of LOs required for their activities. This proposal seeks to
obtain knowledge about the possibilities of using data mining technologies in education,
improving the work of teachers, and instructional designers.
It shows that the use of methods and data mining techniques are useful for the discov-
ery of knowledge from information available in LOs. For example, it is possible to generate
rules from information related to user profiles. The rule set can be used to improve certain
processes, such as searching, sequencing, and editing of LOs.
Clustering tests provided us with relevant information about the attributes that define
each group. The classification and association tests supplied information significant of the
key attributes that provide information to the LOs rules.
The obtained rules allow the development of classifiers to improve the search mecha-
nisms in AGORA. For example, information relating to user profiles and their needs
may be considered as filters for the recommendation of resources or users with similar
needs.
This is a first approach to redefine attributes and consider other information sources to
supplement LOM and allows us to establish what elements are crucial to classify, suggest,
or recommend action values in a learning management system.
Acknowledgments
This work is partially supported by AECID A/016625/08 Project (Spain), YUC 2006-C05-
65811 project FOMIX CONACYT (México), Consejo Nacional de Ciencia y Tecnología
CONACYT (México), Consejo de Ciencia y Tecnología del Estado de Yucatán CONCyTEY
(México) Programa de Mejoramiento del Profesorado PROMEP (México).
References
1. Knolmayer, G.F., Decision support models for composing and navigating through e-learning
objects. In The 36th Annual Hawaii International Conference on System Sciences (HICSS’03), Big Island,
Track 1, HI, 2003.
2. Mohan, P., Reusable online learning resources: Problems, solutions and opportunities. In The
Fourth IEEE International Conference on Advanced Learning Technologies (ICALT’04), Joensuu,
Finland, pp. 904–905, 2004.
Data Mining Learning Objects 491
3. Institute of Electrical and Electronics Engineers, L.T.S.C. Draft standard for learning object
metadata, IEEE, 1484.12.1, 2002. http://ltsc.ieee.org/wg12/files/LOM_1484_12_1_v1_Final_
Draft.pdf (accessed December 15, 2008).
4. Nugent, G., Soh, L.-K., Samal, A., Person, S., and Lang, J., Design, development, and validation
of a learning object for CS1. In The 10th Annual SIGCSE Conference on Innovation and Technology
in Computer Science Education, Lisbon, Portugal, ACM, Caparica, Portugal, 2005.
5. Wiley, D., Connecting learning objects to instructional design theory: A definition, a metaphor,
and a taxonomy. The Instructional Use of Learning Objects, Wiley, D.A. (ed.), Association for
Educational Communications and Technology, Bloomington, IN, 2002.
6. Desmarais, M., Villareal, A., and Gagnon, M., Adaptive test design with a naïve bayes frame-
work. In First International Conference on EDM, Montreal, Canada, 2008. http://www.educa╉
tionaldatamining.org/EDM2008/uploads/proc/5_Desmarais_17.pdf (accessed December 18,
2008).
7. Barnes, T., Stamper, J., Lehman, L., and Croy, M., A pilot study on logic proof tutoring using
hints generated from historical student data. In First International Conference on EDM, Montreal,
Canada, 2008. http://www.educationaldatamining.org/EDM2008/uploads/proc/22_Barnes_╉
41a.pdf (accessed December 18, 2008).
8. Romero, C., Ventura, S., and García, E., Data mining in course management systems: Moodle case
study and tutorial. Department of Computer Sciences and Numerical Analysis, University of
Córdoba, 2007. http://sci2s.ugr.es/keel/pdf/specific/articulo/CAE-VersionFinal.pdf (accessed
October 10, 2008).
9. Ventura, S., Romero, C., and Hervás, C., Analyzing rule evaluation measures with educational
datasets: A framework to help the teacher. In First International Conference on EDM, Montreal,
Canada, 2008. http://www.educationaldatamining.org/EDM2008/uploads/proc/18_Ventura_╉
4.pdf (accessed December 2, 2008).
10. IMS Global Learning Consortium, IMS content packaging, 2004. http://www.imsglobal.org/
content/packaging/cpv1p1p4/imscp_bestv1p1p4.html (accessed December 20, 2008).
11. Advanced Distributed Learning. Sharable Content Object Reference Model (SCORM),
Overview, 2004. http://www.adlnet.org/index.cfm?fuseaction=scormabt (accessed october 15,
2008).
12. IMS Global Learning Consortium. Learning design specification, 2003. http://www.imsglobal.
org/learningdesign/ldv1p0/imsld_infov1p0.html (accessed December 20, 2008).
13. IMS Global Learning Consortium. IMS Simple sequencing, 2003. http://www.imsglobal.org/
simplesequencing/index.html (accessed December 20, 2008).
14. Blanco, J.J., Galisteo del Valle, A., and García, A., Perfil de aplicación LOM-ES V.1.0. Asociación
Española de Normalización y Certificación (AENOR), 2006. http://www.educaplus.org/docu-
mentos/lom-es_v1.pdf (accessed December 16, 2008).
15. Prieto, M.E. et al., Metodología y herramientas para la evaluación de la calidad de los recur-
sos para tele-aprendizaje en la formación de profesores, Internal Report. Agencia Española de
Cooperación Internacional para el Desarrollo, Proyecto AECI A/8172/07, 2008.
16. Prieto, M., Menéndez, V., Segura, A., and Vidal, C. 2008. Recommender system architecture
for instructional engineering. In First World Summit on the Knowledge Society, Athens, Greece,
pp. 314–321, September 24–28, 2008.
17. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., The KDD Process for Extracting Useful Knowledge
from Volumes of Data, ACM, New York, 1996. http://www.citeulike.org/user/imrchen/arti╉
cle/1886790 (accessed November 18, 2008).
18. Witten, I. H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan
Kaufman, San Francisco, CA, 2005.
19. MacQueen, J., Some methods for classification and analysis of multivariate observations. In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA,
pp. 281–297, 1967.
20. Dempster, A., Laird, N., and Rubin, D., Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society, 39(1), 1–38, 1977.
492 Handbook of Educational Data Mining
21. Quinlan, J. R., Induction to decision trees. Machine Learning, 1(1), 81–106, 1986.
22. Ye, P. and Baldwin, T., Semantic role labeling of prepositional phrases. In The Second International
Joint Conference on Natural Language Processing, Jeju Island, Korea, pp. 779–791, 2005.
23. Agarwal, R., Imielinski, T., and Swami, A., Mining association rules between sets of items in
large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management
of Data, Washington, DC, pp. 207–216, 1993.
35
An Adaptive Bayesian Student
Model for Discovering the Student’s
Learning Style and Preferences
Contents
35.1 Introduction......................................................................................................................... 493
35.2 The Learning Style Model................................................................................................. 494
35.3 The Decision Model............................................................................................................ 497
35.3.1 Building the Initial Model..................................................................................... 498
35.3.2 Adapting the Model............................................................................................... 499
35.4 Selecting the Suitable Learning Objects..........................................................................500
35.5 Conclusions and Future Work..........................................................................................500
References...................................................................................................................................... 502
35.1╇ Introduction
In the last years, a considerable number of online educational data designed as learning
objects repositories* has been implemented (e.g., MERLOT [1] and ARIADNE [2]). Using
learning object repositories is a way to increase the flexibility and manageability of rich
libraries of learning resources available online from academic institutions, publishers, and
organizations. In such repositories, learning objects are shared across different learning
environments and can be accessed on demand either by learners or instructors.
One of the key issues concerning the use of learning object repositories is the retrieval
and searching facilities of learning objects. One approach is to “filter” and “sort” the
learning objects according to the student’s learning style and preferences, so he or she
can make a better use of them. To this end, students’ learning styles can be acquired
using one of the existing psychometric instruments. Then, some decision rules are
established. Such rules represent the matches between learning styles and educational
objects. Following this idea, some educational hypermedia systems have implemented
different learning style models for a better adaptation of their educational resources to
their users (e.g., MANIC [4], AES-CS [5], INSPIRE [6], iWeaver [7], TANGOW/WOTAN [8],
WHURLE [9], and CS383 [10]). However, as argued in [11], “There are no proven recipes for
the application of learning styles in adaptation.” First, in the majority of these approaches,
* The learning object metadata [3] is a standard to specify the syntax and semantics of learning objects using a
set of attributes that adequately describe them.
493
494 Handbook of Educational Data Mining
assumptions about the student’s learning style are static, that is, once acquired, they are
no longer updated with evidence gathered from the student’s interactions with the sys-
tem. The rules included in the decision models do not change either. Thus, the model
is used for adaptation, but it is unable to adapt itself with new evidences. Second, dur-
ing the interaction with the system, the student could change his or her preferences for
another kind of learning object that no longer matches with his or her inferred learning
style, a problem known as concept drift [12]. In these scenarios, adaptive decision models,
capable of better fitting the current student’s preferences, are desirable.
The main contribution of our approach is that the model presented is adaptive, i.e., the
initial acquired information about the student’s learning style and preferences is updated
according to the results of the student’s interactions with the system. We use all the back-
ground knowledge available to build an initial learning style model and a decision model for
each particular student. The learning style model, represented as a Bayesian network (BN)
[13] and based on the Felder–Silverman learning style model [14], classifies students in four
dimensions: processing, perception, input, and understanding.
To initialize the learning style model, we use the Felder and Solomon Index of Learning
Style Questionnaire [15]. Completing the questionnaire is not mandatory for the student,
so if he or she chooses not to answer it, we use the uniform distribution. Then, the student’s
selections are set as evidences in the model, triggering the evidence propagation mecha-
nism and getting up-to-date beliefs for the learning styles. This learning style model was
first introduced in [16]. For the decision model, we use a BN classifier [17] that represents
the matches between learning styles and learning objects in order to decide if a resource
is interesting to a student or not. We use a subset of metadata attributes to represent a
learning object [3], in particular, those related with the learning style dimensions: Format,
Learning Resource Type, Interactivity Level, Interactivity Type, and Semantic Density.
We learn an initial classifier from data randomly generated by some predefined rules.
Then, when the student selects a resource (and eventually gives feedback), we will incor-
porate this information to the model in order to reflect more accurately the current prefer-
ences. Moreover, our decision model is capable of adapting quickly to any change of the
student’s preferences. This proposal is an improvement of the approach proposed in [18]
and was first presented in [19], where the learning style once acquired was not updated
and the decision model was modeled using an adaptive naive Bayes classifier.
In Section 35.2, we explain the design of the learning style model. Section 35.3 is devoted
to the description of the decision model. Next, in Section 35.4, we briefly describe the whole
process aimed at selecting the proper learning objects for a student each time he or she
makes a topic selection. Finally, we conclude with a summary and a description of ongoing
and future work.
style, Felder and Soloman proposed the index of learning style questionnaire [15]. The
questionnaire
Â� classifies student’s preferences for each category as mild, moderate, or strong.
The results of this test, if the student has chosen to take it, are used to initialize the learn-
ing style model. In general, the use of tests to initialize student models has some draw-
backs. First, students tend to choose answers arbitrarily. Second, it is really difficult to
design tests capable of exactly measuring “how people learn.” Therefore, the information
gathered through these instruments encloses some grade of uncertainty. Moreover, this
information, as a rule, is no longer updated with evidences gathered from the student’s
interactions with the system.
Our approach uses a BN to model the student’s learning style [21,22], instead of acquir-
ing it by a psychometric test. A BN [23] is composed of two components: the qualitative
part (its structure) and the quantitative part (the set of parameters that quantifies the net-
work). The structure is a directed acyclic graph whose nodes represent random variables,
and the arcs represent dependencies (causal influence relationship) between the variables.
The parameters are conditional probabilities that represent the strength of the relation-
ships. The Bayesian learning style model allows observations about the user’s behavior to
discover learning styles automatically using the inference mechanisms. In these works,
the BN structure is designed by the experts and the parameters are specified from data
obtained from both the expert and the log files.
We propose to design the learning style model using a hybrid approach. The four dimen-
sions of the learning styles model are initialized according to the initial test results. We
then observe the student’s selections of different learning objects to set them as evidences
in the BN. Therefore, whenever new evidences about the preferences of the student arrive
(student’s selections and feedback) the propagation mechanism is automatically triggered
and gets up-to-date beliefs for the learning style. Thus, we can refine the initial values for
the student’s learning style acquired by the initial test as the student interacts with the
system, thus becoming more and more confident over time. In our BN model, we consider
three types of variables:
1. Variables to represent the student’s learning style—one variable for each dimension
of the model: Inputâ•›=â•›{visual, verbal}; Processingâ•›=â•›{active, reflective}; Perceptionâ•›=
�{sensing, intuitive}; Understanding╛=╛{sequential, global}.
2. Variables to represent the selected learning object—one variable for each metadata
attribute that we consider significant for modeling learning style (see Table 35.1):
Table 35.1
Learning Style vs. Learning Object Type Attribute
Visual Verbal Sensing Intuitive Sequential Global Active Reflective
Exercise ◽ ◾ ◾ ◽ ◾ ◽ ◾ ◽
Simulation ◾ ◽ ◾ ◽ ◾ ◽ ◾ ◽
Questionnaire ◽ ◾ ◽ ◾ ◾ ◽ ◾ ◽
Figure ◾ ◽ ◾ ◽ ◽ ◾ ◽ ◾
Index ◽ ◾ ◾ ◽ ◽ ◾ ◽ ◾
Table ◾ ◽ ◾ ◽ ◽ ◾ ◽ ◾
Narrative-Text ◽ ◾ ◽ ◾ ◾ ◽ ◽ ◾
Exam ◽ ◾ ◾ ◽ ◾ ◽ ◾ ◽
Lecture ◽ ◾ ◽ ◾ ◾ ◽ ◽ ◾
Regarding the relationships between the variables, we consider that the student’s learn-
ing style determines the student’s learning objects selections and that the selected learning
object and the student’s learning style determine the rating value for that learning object.
Before the selection of the learning object, the student only knows its format and the activity
it implements. We consider that the student’s selection shows his or her preferences for a par-
ticular kind of learning object, and the preferences are influenced by the student’s learning
style. Only after selecting and viewing the learning object the student can rate it. After hav-
ing explored several possibilities for modeling the learning style dimensions, we choose to
model each dimension separately, thus obtaining four BN models as depicted in Figure 35.1.
To define the a priori distribution for nodes representing the learning style dimensions,
we use the score obtained by the student in the initial test if the student took the test, or a
uniform distribution, otherwise. To estimate the conditional probabilities between learn-
ing style dimensions and each learning object attribute, we use some “matching tables” pre-
viously defined by an expert. These tables allow us to match learning styles with learning
object’s attributes. An example of such a matching table for the SelectedLearningObjectType
variable is shown in Table 35.1.
Input Understanding
Selected
Selected Selected Selected
learning object
format learning object format Selected
type
type semantic
density
Selected
rating Selected
rating
Perception Processing
Selected Selected
learning object Selected
interactivity Selected interactivity
type format type
level
Selected
rating Selected
rating
Figure 35.1
Modeling the four dimensions of the learning style.
Initial BN Input
27.00 visual
73.00 verbal
Updated BN Input
26.04 visual
73.96 verbal
Figure 35.2
Updated beliefs for the input dimension after the student’s selection.
For instance, suppose a student takes the initial test and obtains for the Input dimen-
sion: visualâ•›=â•›3 and verbalâ•›=â•›8. This student is classified as verbal-moderate. Next, every time
the student makes a selection, the sequential update algorithm [13] is triggered in order to
incorporate the new information into the BN model. That makes it possible to refine the
initial beliefs for the student’s learning style accordingly. If the student changes his or her
preferences, that is, he or she begins to select objects that do not match with our current
estimation of his or her learning style, this network is able to interpret and account for
this information and update the model. Figure 35.2 shows how the beliefs of the BN are
updated after several students’ selections.
* A recommender system tries to present to the user the information items he or she is interested in. To do
this, the user’s profile is compared to some reference characteristics. These characteristics may be from the
information item (the content-based approach) or the user’s social environment (the collaborative-filtering
approach).
recommend) and the user’s learning style (the user’s features) are presented to the clas-
sifier as input, having as output a probability that represents the appropriateness of the
resource for this student (or how interesting the item is for this user). There are two issues
that are crucial in the definition of the decision model. First, the cold-start problem, that is,
the problem of obtaining the data to build the initial model. Second, the adaptation proce-
dure for updating the model with new data.
Table 35.2
Establishing the Attributes and Their Possible Values
Attributes Values
Input visualStrong (1); visualModerate (2); balanced (3); verbalModerate (4); verbalStrong (5)
Perception sensingStrong (1); sensingModerate (2); balanced (3); intuitiveModerate (4);
intuitiveStrong (5)
Understanding sequentialStrong (1); sequentialModerate (2); balanced (3); globalModerate (4);
globalStrong (5)
Processing activeStrong (1); activeModerate (2); balanced (3); reflectiveModerate (4);
reflectiveStrong (5)
Learning Object Type Exercise (1); simulation (2); questionnaire (3); figure (4); index (5); table (6); narrative
text (7); exam (8); lecture (9)
Format Text (1); image (2); audio (3); video (4); application (5)
Interactivity Level Very-low (1); low (2); medium (3); high (4); very-high (5)
Interactivity Type Active (1); expositive (2); mixed (3)
Semantic Density Very-low (1); low (2); medium (3); high (4); very-high (5)
* A naive Bayes is a BN with a simple structure that has the class node as the parent node of all other feature
nodes.
Appropriate
Figure 35.3
Initial decision model.
rules extracted from matching tables. We generated several datasets using an increasing
number of instances (3125, 6250, 9375, 12500, 15625, 18750, and 21875) and generated 10
samples for each setting. For each dataset, all the possible learning styles are represented.
Since there are 4 attributes for the learning style and each one has 5 values, we obtain 625
different learning styles. We generated datasets with 5, 10, 15, 20, 25, 30, and 35 examples
for each learning style. The learning object characteristics were generated randomly and
the obtained examples were classified according to the matching rules. We then learn dif-
ferent models from the generated data: the naive Bayes and several k-DBCs varying k from
1 to 5. To learn the k-DBCs, we apply, in conjunction with a score, a hill-climbing proce-
dure as explained in [25]. In the experiments, we use different scores (BAYES, MDL, and
AIC) with a 10-fold cross validation. From the error rates obtained with each model and
each score, we found that the best model was a 2-DBC. From k > 2, the accuracy does not
improve significantly, which may indicate that we found a 2-degree of dependence in these
domains. Regarding the score function, the AIC score produces the low error rate, but the
model generated is very complex, almost every node has two parents (besides the class),
so we chose the model generated using the BAYES score because of its simplicity and good
performance. The error rate for the 2-DBC model with the BAYES score is 6.3%. The struc-
ture of the chosen model is shown in Figure 35.3. In addition to the relationships between
the class and the attributes, we found other dependences between the attributes.
and added to all the corresponding counters of the predicted class and proportionally sub-
tracted to the counters of all the other classes. If an example is correctly classified then the
increment is positive and equal to 1–P(predicted|X), otherwise it is negative. Experimental
evaluation showed consistent reductions of the error rate. The main idea we propose is to
use the student’s ranks instead of the categorical class values for the adaptation procedure.
We consider different increment values according to the quantitative differences between
the observed class and the predicted class. For instance, if a learning object is classified as
appropriate with a high probability (5star) and the student rates this learning object as 4star,
then we use an increment with a value greater than the value used when the student ranks
this resource as 1star.
1. Filtering: When a student logs in, we use some deterministic rules to filter those
learning objects that match the student’s preferred language defined in his or her
profile.
2. Prediction: When the student selects a topic, we filter the learning objects and apply
a third filter to obtain the learning objects for that topic that matches the stu-
dent’s knowledge level. After that, the current decision model is used to classify
each selected object as “appropriate” or “not appropriate” for the student. To do it,
examples including the learning style attributes (inferred from the learning style
model) and the learning object’s characteristics are automatically generated and
classified by the decision model. Since we use a probabilistic classifier, the learn-
ing objects can be easily ranked to be shown to the student.
3. Adaptation: All the learning objects explain the same topic, so when the student
selects a particular learning object we assume that it should be interesting to the
student, especially by its characteristics. To obtain more feedback, we also sug-
gest to the user to explicitly rate each learning object. Whenever we will obtain
new evidences about the real student preferences for a particular learning object,
we will use this information to adapt both the learning style model and the deci-
sion model.
Student’s
preferred Resource’s
Log-in languages languages
Educational
system Filtering rule
Student
1° Filtered LOR
This subset contains those resources
that match the student’s language.
Topic’s content
Resource’s
Appropriate difficulty level
Knowledge model
Not appropriate
Student’s
knowledge level
Learning style
Classification Filtering rule
model
Student’s
Resources’
learning style
characteristics
Examples generation
Predictive model
This subset contains those resources 3° Filtered LOR
that match the student’s knowledge level.
3°
Student’s
Selects Appropriate
feedback
resource Model adaptation
Not appropriate Update Update
Student
Topic’s content
Predictive model
Learning style
model
Figure 35.4
The selection of learning objects task.
so that their users can make a better use of it. To discover the user’s preferences, we
use the information about learning styles represented in the student’s Bayesian learning
style model. The advantage of using a Bayesian model is that this allows refining the
initial beliefs acquired by the initial test by observing the student’s selections over time,
thus computing up-to-date learning style for each student. On the other hand, we use
an adaptive BN classifier as the decision model to determine whether a given resource
is appropriate for a specific learning style or not. We described the experiments carried
out to obtain an initial model, thus solving the cold-start problem. For each student, we
initialize the decision model from data generated from the matching tables (set of rules
defined by an expert that represents the matches between learning styles and learning
objects). Each individual decision model is then adapted from the observations of the
student’s selections and ranks over time. Moreover, the model is also able to adapt itself
to changes in the student’s preferences. Although the proposed model has not yet been
evaluated using data generated from experiments with real students, we believe that
both BN and adaptive BN classifiers are suitable choices to model learning styles and
students’ preferences in e-learning. Using the proposed models allows dealing with the
uncertainty inherited in the acquisition of this information about a particular user and
also with the unexpected changes of the students’ behavior over time. In future work,
we plan to carry out experiments with simulated students and also in the context of a
real e-learning system with real students in order to evaluate the performance of our
proposal.
References
1. Merlot, http://www.merlot.org. URL last accessed on January 2008.
2. Ariadne, http://www.ariadne-eu.org/. URL last accessed on January 2008.
3. IEEE Learning Technology Standards Committee. IEEE standard for learning object metadata.
IEEE. http://ltsc.ieee.org/wg12/files/LOM_1484_12_1_v1_Final_Draft.pdf. URL last accessed
on January 2008.
4. Stern, M.K. and Woolf, B.P. Adaptive content in an online lecture system. Proceedings of the
International Conference on Adaptive Hypermedia and Adaptive Web Based Systems (AH2000),
Trento, Italy, pp. 227–238, 2000.
5. Triantafillou, E., Pomportsis, A., and Demetriadis, S. The design and the formative evaluation
of an adaptive educational system based on cognitive styles. Computers & Education 41 (2003)
87–103.
6. Papanikolaou, K.A., Grigoriadou, M., Kornilakis, H., and Magoulas, G. D. Personalizing
the inter-action in a Web-based educational hypermedia system: The case of INSPIRE. User-
Modeling and User-Adapted Interaction 13(3) (2003) 213–267.
7. Wolf, C. iWeaver: Towards learning style-based e-learning in computer science education.
Proceedings of the Fifth Australasian Computing Education Conference (ACE2003), Adelaide,
Australia, pp. 273–279, 2003.
8. Paredes, P. and Rodriguez, P. The application of learning styles in both individual and col-
laborative learning. Proceedings of the Sixth IEEE International Conference on Advanced Learning
Technologies (ICALT’06), Kerkrade, the Netherlands, pp. 1141–1142, 2006.
9. Brown, E., Stewart, C., and Brailsford, T. Adapting for visual and verbal learning styles in
AEH. Proceedings of the Sixth IEEE International Conference on Advanced Learning Technologies
(ICALT’06), Kerkrade, the Netherlands, pp. 1145–1146, 2006.
An Adaptive Bayesian Student Model 503
10. Carver, C.A., Howard, R.A., and Lane, W.D. Enhancing student learning through hypermedia
courseware and incorporation of student learning styles. IEEE Transactions on Education 42(1)
(1999) 33–38.
11. Brusilovsky P. and Millán, E. User models for adaptive hypermedia and adaptive educational
systems. The Adaptive Web: Methods and Strategies of Web Personalization, LNCS 4321 (2007) 3–53.
12. Webb, G., Pazzani, M., and Billsus, D. Machine learning for user modeling. User Modeling and
User-Adapted Interaction 11 (2001) 19–29.
13. Jensen, F.V. and Nielsen T. Bayesian Networks and Decision Graphs. Springer Verlag Inc.,
New York, 2007.
14. Felder, R.M. and Soloman, B.A. Learning styles and strategies, 2003. URL last accessed on
January 2008. http://www.ncsu.edu/felder-public/ILSdir/styles.htm
15. Felder, R.M. and Soloman, B.A. Index of learning style questionnaire (ILSQ). URL last accessed
on January 2008. http://www.engr.ncsu.edu/learningstyles/ilsweb.html
16. Carmona, C., Castillo, G., and Millán, E. Designing a dynamic Bayesian network for mod-
eling student’s learning styles. The Eighth IEEE International Conference on Advanced Learning
Technologies (ICALT 2008), Santander, Spain, pp. 346–350, 2008.
17. Friedman, N., Geiger, D., and Goldszmidt, M. Bayesian network classifiers. Machine Learning
29(2–3) (1997) 131–163.
18. Castillo, G., Gama, J., and Breda, A.M. An adaptive predictive model for student modeling.
Advances in Web-Based Education: Personalized Learning Environments, Chap. IV, London, UK:
Information Science Publishing, 2005.
19. Carmona, C., Castillo, G., and Millán, E. Discovering student preferences in e-learning.
International Workshop on Applying Data Mining in e-Learning (ADML’07), Crete, Greece,
pp. 33–42, 2007.
20. Sahami, M. Learning limited dependence Bayesian classifiers. Proceedings of the Second
International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, AAAI
Press, Menlo Park, CA, pp. 335–338, 1996.
21. García, P., Amandi, A., Schiaffino, S., and Campo, M. Evaluating Bayesian networks’ precision
for detecting students’ learning styles. Computers & Education 49 (2007) 794–808.
22. Dan, Y. and XinMeng, C. Using Bayesian networks to implement adaptivity in mobile learn-
ing. Proceedings of the Second International Conference on Semantics, Knowledge, and Grid (SKG’06),
Guilin, China, p. 97, 2006.
23. Pearl, J. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. San Francisco,
CA: Morgan Kaufmann Publishers, Inc., 1988.
24. Gama, J. Iterative Bayes. Discovery Science—Second International Conference, LNAI, Vol. 1721,
Tokyo, Japan, pp. 80–91, 1999.
25. Castillo, G. and Gama, J. Adaptive Bayesian network classifiers. International Journal of Intelligent
Data Analysis 13(1) (2009) 39–59.