A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
Kongu Engineering College, Perundurai, Erode, Tamilnadu, India.17 & 18 February, 2011.pp.27-32.
Abstract—Data mining an non-trivial extraction of novel, statisticians, database researchers and business communities.
implicit, and actionable knowledge from large data sets is an A data mining software does not just change the
evolving technology which is a direct result of the increasing presentation, but discovers previously unknown
use of computer databases in order to store and retrieve relationships among the data. The information on which the
information effectively .It is also known as Knowledge
data mining process operates is contained in a historical
Discovery in Databases (KDD) and enables data exploration,
data analysis, and data visualization of huge databases at a high database of previous interactions. In principle, data mining is
level of abstraction, without a specific hypothesis in mind. The not specific to one type of media or data. Data mining
working of data mining is understood by using a method called should be applicable to any kind of information repository.
modeling with it to make predictions. Data mining techniques Some kinds of information that is collected are as follows:
are results of long process of research and product
development and include artificial neural networks, decision A. Business transactions
trees and genetic algorithms. This paper surveys the data Every transaction in the business industry is (often)
mining technology, its definition, motivation, its process and “memorized” for perpetuity. Such transactions are usually
architecture, kind of data mined, functionalities and
time related and can be inter-business or intra-business
classification of data mining, major issues, applications and
directions for further research of data mining technology. operations effective use of the data in a reasonable time
frame for competitive decision making is definitely the most
Keywords—data mining, KDD, recent techniques, important problem to solve for businesses that struggle to
modeling, further research. survive in a highly competitive world.
1) Database, data warehouse, or other information Fig 1: Architecture of a typical data mining system.
repository: This component is one or a set of databases, data
warehouses, spread sheets, or other kinds of information
repositories. Data cleaning and data integration techniques B. The Process of Data Mining
may be performed on the data.
The Knowledge Discovery in Databases process
2) Database or data warehouse server: The component comprises of a few steps leading from raw data collections
is responsible for fetching the relevant data, based on the to some form of new knowledge. The iterative process
data mining request of the user. consists of the following steps:
1) Data cleaning: It can also be termed as data
3) Knowledge base: This is the domain knowledge that cleansing. It is a phase wherein noise data and irrelevant
is used to guide the search, or evaluate the interestingness of data are removed from the collection.
resulting patterns. It includes concept hierarchies that are
used to organize attributes or attribute values into different 2) Data integration: In this stage multiple data sources
levels of abstraction. that are heterogeneous can be combined in a common
source.
4) Data mining engine: This is an essential component
of the data mining system and ideally consists of a set of 3) Data selection: In this phase the data relevant to the
functional modules for tasks such as characterization, analysis is decided on and retrieved from the data collection.
association analysis, classification, evolution and deviation
analysis. 4) Data transformation: It can also be known as data
consolidation and it is a phase in which the selected data is
5) Pattern evaluation module: This component typically transformed into forms appropriate for the mining
employs interestingness measures and interacts with the data procedure.
mining modules so as to focus the search towards interesting
patterns. It can access interestingness thresholds stored in 5) Data mining: This is a crucial phase wherein skillful
the knowledge base. Alternatively, the pattern evaluation techniques are applied to extract patterns potentially useful.
module may be integrated with the mining module, Pattern evaluation: In this current phase strictly interesting
depending on the implementation of the data mining method patterns representing knowledge are identified with respect
used. Efficient data mining is possible by pushing the to the given measures.
evaluation of pattern interestingness deeply into the mining
process so as to connect the search to only the interesting 6) Knowledge representation: This is the final phase in
patterns. which the discovered knowledge is visually represented to
the user. This is an essential step which uses visualization
6) Graphical user interface: This module communicates techniques to help users understand and interpret the data
between users and the data mining system, and allows the mining results.
28
Proceedings of the National Conference on Innovations in Emerging Technology-2011.
A. Characterization
Fig2: Data mining as a process of knowledge discovery
Data characterization is a summarization of general
features of objects in a target class, and produces what is
The KDD is an iterative process. Once the discovered called characteristic rules. The data relevant to a user-
knowledge is presented to the user, the evaluation measures specified class are normally retrieved by a database query
can be enhanced, the mining can be further refined, new data and run through a summarization module to extract the
can be selected or further transformed, or new data sources essence of the data at different levels of abstractions.
can be integrated, in order to get different, more appropriate
results. B. Discrimination
Data discrimination produces what are called
The types of data mined are as follows: discriminant rules and is basically the comparison of the
1) Flat files: Flat files are simple data files in text general features of objects between two classes referred to as
or binary format with a structure known by the data mining the target class and the contrasting class.
algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements. C. Association analysis
Association analysis is the discovery of what are
2) Relational Databases: Briefly, a relational commonly called association rules. It studies the frequency
database consists of a set of tables containing either values of items occurring together in transactional databases, and
of entity attributes, or values of attributes from entity based on a threshold called support, identifies the frequent
relationships. Tables have columns and rows, where item sets. Another threshold, confidence, which is the
columns represent attributes and rows represent tuples. conditional probability than an item appears in a transaction
when another item appears, is used to pinpoint association
3) Data Warehouses: A data warehouse as a rules. Association analysis is commonly used for market
storehouse is a repository of data collected from multiple basket analysis.
data sources (often heterogeneous) and is intended to be
used as a whole under the same unified schema. A data D. Classification
warehouse gives the option to analyze data from different Classification analysis is the organization of data in
sources under the same roof. given classes. Also known as supervised classification, the
classification uses given class labels to order the objects in
4) Transaction Databases: A transaction database the data collection. Classification approaches normally use a
is a set of records representing transactions, each with a time training set where all objects are already associated with
stamp, an identifier and a set of items. Associated with the known class labels. The classification algorithm learns from
transaction files could also be descriptive data for the items. the training set and builds a model. The model is used to
classify new objects.
5) Multimedia Databases: Multimedia databases
include video, images, audio and text media. They can be E. Prediction
stored on extended object-relational or object-oriented There are two major types of predictions: one can
databases, or simply on a file system. either try to predict some unavailable data values or pending
6) Spatial Databases: Spatial databases are trends, or predict a class label for some data and is tied to
databases that store geographical information like maps, and classification. Once a classification model is built based on a
global or regional positioning in addition to usual data. training set, the class label of an object can be foreseen
7) World Wide Web: The World Wide Web is the based on the attribute values of the object and the attribute
most heterogeneous and dynamic repository available. Data values of the classes. Prediction is referred to the forecast of
missing numerical values, or increase/ decrease trends in
29
A Conceptual Overview of Data Mining
time related data. The major idea is to use a large number of systems offer several data mining functionalities together
past values to consider probable future values. and tend to be comprehensive systems.
30
Proceedings of the National Conference on Innovations in Emerging Technology-2011.
new implicit knowledge about individuals or groups that B. Surveillance / Mass surveillance
could be against privacy policies, especially if there is Surveillance is the monitoring of the behavior,
potential dissemination of discovered information. There activities, or other changing information, usually of people
arises another issue from this concern that is the and often in a surreptitious manner. It most usually refers to
appropriate use of data mining. Due the competitive observation of individuals or groups by government
advantage attained from implicit knowledge discovered, organizations, but disease surveillance, for example, is
some of the important information could be withheld and monitoring the progress of a disease in a community.
other information can be widely distributed and can be used
without control. C. National Security Agency
31
A Conceptual Overview of Data Mining
have given a brief explanation about the typical architecture
of data mining and explained the steps of the data mining
process. This paper abstracts the functionalities of data
mining and describes the classification of data mining
systems. It spills the lime light on the benefits the data
mining technology offers to the present day world. We also
discus about the major issues that need to be addressed and
mention a few applications wherein data mining technology
can be applied. Therefore, from a strategic perspective, the
need to navigate the rapidly growing universe of digital data
will rely heavily on the ability to effectively manage and
mine the raw data.
REFERENCES
32