Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
51 views7 pages

Implementing Classification Techniques of Data Mining in Creating Model For Predicting Academic Marketing

Review Related Lit

Uploaded by

Jaydwin Labiano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views7 pages

Implementing Classification Techniques of Data Mining in Creating Model For Predicting Academic Marketing

Review Related Lit

Uploaded by

Jaydwin Labiano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Journal of Scientific Research & Reports

7(7): 494-500, 2015; Article no.JSRR.2015.230


ISSN: 2320-0227

SCIENCEDOMAIN international
www.sciencedomain.org

Implementing Classification Techniques of Data


Mining in Creating Model for Predicting Academic
Marketing
Sheila A. Abaya1*, Bobby D. Gerardo2 and Bartolome T. Tanguilig3
1
Information Technology, Technological Institute of the Philippines, Currently a Faculty of the
Department of Computer Studies and Systems, University of the East, Caloocan, Philippines.
2
Administration and Finance, West Visayas State University, Iloilo City, Philippines.
3
Academic Affairs, Dean of CITE and Graduate Programs Department, Technological Institute of the
Philippines, Quezon City, Philippines.

Authors’ contributions

The work was carried out in collaboration between all authors. Author SAA wrote the first draft of the
manuscript. Author BDG suggested the classification techniques possible for comparison as well as
some literature searches. Author BTT recommended improvement on data scaling. Author SAA
managed the experimental procedures and identified the best classification technique for predicting
probable academic market for tertiary education.

Article Information

DOI: 10.9734/JSRR/2015/16940
Editor(s):
(1) Saad Mohamed Saad Darwish, Department of Information Technology, Institute of Graduate Studies and Research (IGSR),
University of Alexandria, Egypt.
(2) James P. Concannon, Associate Professor of Education, Westminster College, Fulton, Missouri, USA.
(3) Luigi Rodino, Professor of Mathematical Analysis, Dipartimento di Matematica, Università di Torino, Italy.
Reviewers:
(1) Anonymous, Stefan cel Mare University Suceava, Romania.
(2) M. Bhanu Sridhar, Dept. of CSE, GVP College of Engg. For Women, Vizag, India.
(3) Anonymous, Tribhuvan University, Nepal.
(4) G. Y. Sheu, Accounting Information Systems, Chang-Jung Christian University, Tainan, Taiwan.
(5) Anonymous, The University of Silesia, Poland.
(6) Anonymous, University of Tampere, Finland.
Complete Peer review History: http://www.sciencedomain.org/review-history.php?iid=1131&id=22&aid=9563

Received 19th February 2015


th
Original Research Article Accepted 7 May 2015
rd
Published 3 June 2015

ABSTRACT

The education domain is one of the business areas with abundant data. Nowadays, most of tertiary
educational institutions have dilemmas in identifying probable secondary schools which are
considered as feeders for enrollment. The data mining technique of classification has been used in

_____________________________________________________________________________________________________

*Corresponding author: Email: [email protected];


Abaya et al.; JSRR, 7(7): 494-500, 2015; Article no.JSRR.2015.230

this research to easily identify the target secondary schools for enrollment. With these techniques,
higher educational institutions may lessen the marketing cost by filtering which of these secondary
schools are considered enrollment contributors. The techniques of ID3, C4.5, BayesNet and Naïve
Bayes were used in this research implemented on WEKA 3.6.0 toolkit [1]. Based on the
experimental results, C4.5 outperformed ID3, BayesNet and Naïve Bayes in determining the best
classification technique to identify the targeted secondary schools qualified for enrollment in tertiary
level. The model created can aid in education management’s decision making process in terms of
student recruitment.

Keywords: C4.5; J48; Id3; bayes net; naïve bayes.

1. INTRODUCTION data mining technique of clustering in grouping


students with dissimilar behavior. Based on this
Data Mining (DM) has been treated as the paper, J48 (an implementation of C4.5 in WEKA)
forefront of business technologies [2]. With the is known to be the most used WEKA
overwhelmed increasing size of data in every classification algorithm that is noted to provide
business, patterns can be identified, validated stability on precision, speed and interpretability of
and can be used for prediction. DM has several results simply because of the use of decision
functionalities or tasks [3] that identify what kind tree.
of data patterns can be mined. One of which is
Grossman, [7] labored on the comparison of
the classification technique, [4] that can be used
Bayesian Network Classifier (BNC) with other
in predicting the target secondary feeder schools
algorithms of classification such as C4.5, Naïve
for enrollment in higher education. The methods
Bayes (NB), Tree-Augmented Naïve Bayes
of ID3, C4.5, BayesNet and Naïve Bayes were
(TAN) by Friedman, original Bayesian network
used in this research to identify which classifier
structure search algorithm (HGC) by Heckerman,
works best in producing the model that identifies
Maximum Likelihood Learners using the MDL
and determines the probable secondary schools
score (ML-MDL) and two-parent nodes (ML-2P)
which can be considered as the target schools
and NB-ELR and TAN-ELR, NB and TAN with
for enrollment in higher educational institutions.
parameters optimized for conditional log
likelihood of Greiner and Zhou (2002). Based on
2. RELATED LITERATURES the result, BNC can be learned by maximizing
conditional likelihood and thus provide a better
Several studies have been conducted to classification probability among the other
compare different classification techniques. methods.
Sharma, et al. [5] worked on the comparative Heckerman, [8] technical report on learning
analysis of J48, ID3, ADTree, and SimpleCART Bayesian network discusses the advantages of
classification techniques for spam emails. The using BayesNet in classification and prediction.
research focused on data analysis of email to BayesNet can handle missing data entries; used
identify whether the message is a spam email or for causal relationship that understands problem
not. The experiment was done using WEKA by domains and predict the outcome of intervention;
WEKA Machine Learning Project of the used ideally for representing prior knowledge;
University of Waikato in New Zealand. There and avoids over fitting of data.
were 4,601 instances with 1,831 spam
categories and 58 attributes from which 57 are Naenudorn, [9] research compares the
continuous and 1 is nominal. The results of the classification techniques of ID3, J48, Naïve
experiment proved that J48 (C4.5) has the Bayes and OneR in predicting the features of
highest classification accuracy of 92.7624% students who are likely to undergo the process of
where 4,268 instances were classified correctly student admission. The data set of student used
and 333 instances were classified otherwise. is from 2009 – 2011 with 6 attributes and 2,148
instances. The results of the experiment
Bresfelean, [6] research focused on the identified J48 (C4.5 implementation in WEKA) as
application of classification technique in the algorithm that provides the highest accuracy
predicting probable student’s choice in continuing model and can be used to predict the future
their education with post university studies and outcome of the pattern of students willing to
their preference in certain fields of study and the enroll in the university.

495
Abaya et al.; JSRR, 7(7): 494-500, 2015; Article no.JSRR.2015.230

Adhatrao, [10] works on applying the where:D|a*=b contains all instances of D


classification techniques of ID3 and C4.5 in for which a* has the value b A-a* consists
predicting student’s performance. Classification of all attributes of A except a*
techniques were used rather than clustering attach T(a*=b) to the root of T as a subtree
because the former is suitable for prediction }
which is the subject matter of the research while return the resulting decision tree T }
clustering works on unknown class and are
discovered from data. The tool was developed 3.2 BayesNet (BN) and NaiveBayes
using PHP to interpret the decision trees of ID3
and C4.5 after data processing. It was found out BN is a graphical model for probability
from the results that both ID3 and C4.5 achieved relationships among a set of variable features
an accuracy result of approximately 75.275%. [13]. The model is believed to be true but with
uncertainties and considered as subjective
Abaya, [11] compares the classification probability [14]. A Naïve Bayes is a simple BN
algorithms of C4.5 and Bayes Net using 1,970 classifier that produces a simple structure node
instances with 4 attributes and the final class is which serves as the parent node of all nodes and
defined as “Enrolled” and “DidNotEnroll”. A test no other connections are allowed [15].
set was also used to check for the accuracy of
the model with 27 instances. The algorithms This technique uses the algorithm of K2 by
were implemented in WEKA. Based on the Cooper (1992).
experimental results, the accuracy is close to
K2 Pseudocode:
56% in favor of C4.5 algorithm in identifying
potential market for enrollment.
Procedure K2
For i:=1 to n do
3. BACKGROUND KNOWLEDGE i = ;
Pold = g(i, i );
3.1 ID3 and C4.5 OKToProceed := true
while OKToProceed and | i |<u do
ID3 (Iterative Dichotomiser) is a decision tree let z be the node in Pred(xi)- i that
algorithm developed by Ross Quinlan in the late maximizes g(i, i {z});
1970s and early 1980s and a predecessor of
Pnew = g(i, i {z});
C4.5 were originally intended for classification.
if Pnew > Pold then
These methods follow the greedy approach in
Pold := Pnew ;
constructing decision trees. Trees are
constructed in a top down approach in a divide-
i :=i {z} ;
else OKToProceed := false;
and-conquer manner. ID3 uses information gain
end {while}
in selecting relevant attributes while C4.5 uses
the extension of information gain which is known write(“Node:”, “parents of this nodes :”, i
as the gain ratio [4]. );
end {for}
ID3/C4.5 Pseudocode [12]: end {K2}

If the set of remaining non-class attributes 4. METHODOLOGY


is empty or if all of the instances in D are in
the same class 4.1 Data Preparation

return an empty tree The data preparation structure presented in


else { Fig. 1 illustrates how the training data set as well
compute the class entropy of each attribute as the test data sets was derived.
over the dataset D
let a* be an attribute with minimum class
entropy
create a root node for a tree T; label it with
a*
for each value b of attribute a* {
let T(a*=b) be the tree computed recursively
by ID3 on input (D|a*=b, A-a*, C), Fig. 1. Composition of data preparation

496
Abaya et al.; JSRR, 7(7): 494-500, 2015; Article no.JSRR.2015.230

This historical data consist of the students’ Table 2. Attribute values


demographic information who took the College
Entrance Test. This database goes into Attributes Alias Values
transformation where it was preprocessed and Average A A1{75-79), A2(80-
cleaned to achieve reliable results. 84},A3{85-89},A4{90-
Preprocessing and cleaning happen when the 94},A5{95-100}
historical file continuous values are converted Distance D D1{1-9KM},D2{10-
into discrete values. When these data are 20KM},D3{=>21KM}
transformed, the attributes [11] are identified as
Ownership O PRI{Private},
Students General Weighted Average (Average);
PUB{Public}
Parent’s Income Bracket (Salary); School’s
Distance (Distance); School Ownership Salary S S1{500-61999},
(Ownership) and Class. These attributes define S2{62000-192999},
the segmentation of the university’s target S3{193000-603999},
market. Table 1 defines the attributes used in the S4{604000-9999999}
data set while Table 2 defines the attribute Class C Enrolled or DidNotEnroll
values for Average, Distance, Ownership, Salary
and Class. An excerpt of the pruned tree is interpreted in the
IF-THEN-RULES as follows:
Table 1. Definition of relevant attributes
If Distance = D1 and
Attribute Definition
Average The average grade of student If Ownership = PRI then 3,216 instances are
before entering the higher classified as “DidNotEnroll”
education institution
Distance Refers to the proximity of If Ownership = PUB and Average = A1 and
(target) secondary schools to the Salary = S1 then 73 instances are classified as
tertiary institution “Enrolled”
Ownership It identifies the type of
management the university has. If Distance = D2 and
Salary Defines the parent’s salary
range of the prospective If Ownership = PRI then 473 instances are
secondary students classified as “Enrolled”
Class It defines prospective students
whether they will “enroll” or If Distance = D3 then There are 480 instances
“willnotenroll” in the higher are classified as “DidNotEnroll”
educational institution.
It refers to the final predictive Fig. 3 is interpreted as:
value of :Enrolled” or “did not
Enroll”
The occurrence of Average (A), Salary (S),
Distance (D) and Ownership (O) is caused by the
4.2 Experimental Results attribute Class (C).

The four classifiers (ID3, J48 an implementation Table 3 presents the accuracy result after
of C4.5, BayesNet and Naïve Bayes) were used running the classifiers in WEKA. Using the
in the WEKA toolkit. The training dataset has training data set test option ID3 and C4.5
6,409 instances with 5 attributes while the test identified 63% of Correctly Classified Instances
dataset has 1,015 instances with 5 attributes. (CCI) and 37% of Incorrectly Classified Instances
Fig. 2 (a) shows the decision tree derived from (ICI), while BayesNet and Naïve Bayes identified
C4.5, Fig. 2 (b) presents the pruned tree 62% CCI with 38% ICI. Using the Supplied Test
interpreted in the IF_THEN_RULES while Fig.3 Data Set option, ID3 has the CCI of 71% which is
shows the visualized graph of BayesNet where C not far from the CCI of C4.5 which is 72% and
represents the Class (Enrolled or DidNotEnroll), ICI of 29% and 28% respectively, while
A is for Average, S for Salary, D represents BayesNet and Naïve Bayes both identified a CCI
Distance and O is identified as Ownership. of 52% and an ICI of 47%.

497
Abaya et al.; JSRR, 7(7): 494-500, 2015; Article no.JSRR.2015.230

Fig. 2(a). J48 an example of visualize decision tree

Fig. 2 (b). An example of J48 pruned tree

Table 3. Comparison of accuracy result

Classifier Training dataset Supplied Test dataset


CCI (%) ICI (%) CCI (%) ICI (%)
ID3 63.25 36.75 71.43 28.57
J48/C4.5 63.05 36.95 72.02 27.98
BayesNet 61.59 38.41 52.32 47.68
NaiveBayes 61.59 38.41 52.32 47.68

498
Abaya et al.; JSRR, 7(7): 494-500, 2015; Article no.JSRR.2015.230

Conference of Engineers and Computer


Scientists 2014” available link IS “
HTTP://WWW.IAENG.ORG/PUBLICATION/IMECS2014/
IMECS2014_PP342-345.PDF ” DATE: MARCH 12 -
14, 2014,VOL I .

COMPETING INTERESTS

Authors have declared that no competing


interests exist.

REFERENCES
Fig. 3. An example of BayesNet graph
1. Wang W. A tutorial in WEKA. Data Mining
5. CONCLUSION / RECOMMENDATION & Statistics within the Health Services,
University of East Anglia; 2010.
The relevant attributes that were used in creating 2. Witten I, Frank E, Hall M. Data mining
a model for predicting the probable secondary practical machine tools and techniques. 2nd
school feeders for higher educational institutions ed., Elsevier Inc; 2005.
are as follows: Average (which defines the 3. Calders T, Pechenizkiy M. Introduction to
general weighted average of a student before the special section on educational data
entering the university), Salary (which identifies mining. SIGKKD Explorations. 2012;13(2).
the income bracket of the student’s parents), 4. Han J, Kamber M. Data mining concepts
Distance (which defines the proximity of and technique. 2nd ed; 2006.
secondary school to the prospective 5. Sharma AK, Sahni S. A comparative study
university/college institution), Ownership (which of classification algorithms for spam emails
defines the type of school management whether data analysis. International Journal of
it is privately owned or publicly operated), and Computer Science and Engineering. 2011;
Class (Final outcome of the prediction whether 3(5):1890–1895.
the student will enroll or will not enroll in the
6. Bresfelean VP. Data mining applications in
tertiary institution). Values for these attributes
higher education and academic
were also presented in the paper which was
intelligence management. Theory and
based from the actual values used in schools.
Novel Applications of Machine Learning, I-
With the results derived after running the four
Tech, Vienna, Austria; 2009.
classifiers in WEKA toolkit, since ID3 and C4.5
correctly classifies instances of 71% and 72% 7. Grossman D, Domingos P. Learning
respectively vis-à-vis with the result of BayesNet bayesian network classifiers by maximizing
and Naïve Bayes wherein both classifiers conditional likelihood. In Proc. 21st
correctly classified 52% of instances. Based International Conference on Machine
from the generated results, it is therefore Learning, Banff Canada; 2004.
concluded that decision tree classifiers 8. Heckerman D. A tutorial in learning
outperformed graphs in creating models for bayesian networks. Microsoft Research
prediction. Moreover, the model that was created Advanced Technology Division, Microsoft
using the decision tree classifiers can be used in Corporation Redmond, WA98052; 1995.
predicting the qualified secondary schools for 9. Naenudorn E, et al. Classification model
academic recruitment. induction for student recruiting. Latest
Advances in Educational Technologies.
ACKNOWLEDGMENTS 2012;117–122.
10. Adhatrao K, et al. Predicting student’s
The author expresses gratitude to the University performance using ID3 and C4.5
of the East, Philippines. classification algorithms. International
Journal of Data Mining & Knowledge
DISCLAIMER Management Process. 2013;3(5).
11. Abaya S, et al. Comparison of
This manuscript was presented in the conference classification techniques in education
“Proceedings of the International Multi marketing. Proceedings of the International

499
Abaya et al.; JSRR, 7(7): 494-500, 2015; Article no.JSRR.2015.230

Multi Conference of Engineers and 14. Ruggeri F, Faltin F, Kenett R. Bayesian


Computer Scientists, Vol. 1, IMECS 2014; networks encyclopedia of statistics in
2014. quality & reliability. Wiley and Sons; 2007.
12. Available:http://www.cs.bc.edu/~alvarez/M 15. Cheng J, Greiner R. Comparing Bayesian
L/id3.html (Retrieved March 9, 2015) Network classifiers. Department of
13. Al-Nabi D, Ahmed S. Survey on Computing Science, University of Alberta,
classification algorithms for data mining: Canada.
(Comparison and Evaluation). Computer Available:http://arxiv.org/ftp/arxiv/papers/1
Engineering and Intelligent Systems. 2013; 301/1301.6684.pdf (Retrieved March 10,
4(8). 2015)
_______________________________________________________________________________
© 2015 Abaya et al.; This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.

Peer-review history:
The peer review history for this paper can be accessed here:
http://www.sciencedomain.org/review-history.php?iid=1131&id=22&aid=9563

500

You might also like