Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
112 views6 pages

A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views6 pages

A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the National Conference on Innovations in Emerging Technology-2011

Kongu Engineering College, Perundurai, Erode, Tamilnadu, India.17 & 18 February, 2011.pp.27-32.

A Conceptual Overview of Data Mining


B.N. Lakshmi. #1,G.H. Raghunandhan. #2
#1, 2
Department of Computer science Engineering, Reva Institute of Technology and Management;
Bangalore, Karnataka, India
1
[email protected]

Abstract—Data mining an non-trivial extraction of novel, statisticians, database researchers and business communities.
implicit, and actionable knowledge from large data sets is an A data mining software does not just change the
evolving technology which is a direct result of the increasing presentation, but discovers previously unknown
use of computer databases in order to store and retrieve relationships among the data. The information on which the
information effectively .It is also known as Knowledge
data mining process operates is contained in a historical
Discovery in Databases (KDD) and enables data exploration,
data analysis, and data visualization of huge databases at a high database of previous interactions. In principle, data mining is
level of abstraction, without a specific hypothesis in mind. The not specific to one type of media or data. Data mining
working of data mining is understood by using a method called should be applicable to any kind of information repository.
modeling with it to make predictions. Data mining techniques Some kinds of information that is collected are as follows:
are results of long process of research and product
development and include artificial neural networks, decision A. Business transactions
trees and genetic algorithms. This paper surveys the data Every transaction in the business industry is (often)
mining technology, its definition, motivation, its process and “memorized” for perpetuity. Such transactions are usually
architecture, kind of data mined, functionalities and
time related and can be inter-business or intra-business
classification of data mining, major issues, applications and
directions for further research of data mining technology. operations effective use of the data in a reasonable time
frame for competitive decision making is definitely the most
Keywords—data mining, KDD, recent techniques, important problem to solve for businesses that struggle to
modeling, further research. survive in a highly competitive world.

I.INTRODUCTION B.Scientific data


Our society is amassing colossal amounts of
With the advent of computers and means for mass scientific data that need to be analyzed. Unfortunately, we
digital storage we started collecting and storing all sorts of can capture and store more new data faster than we can
data, counting on the power of computers to help sort analyze the old data already accumulated.
through this amalgam of information. This massive
collections of data stored on disparate structures very rapidly C.Medical and personal data
became overwhelming and led to the creation of structured From government census to personnel and
databases and database management systems (DBMS).The customer files, very large collections of information are
database management systems efficiently manage large continuously gathered about individuals and groups. This
corpus of data and effective and efficient retrieval of type of data often reveals if the information is collected,
particular information from a large collection whenever used and even shared. When correlated with other data this
needed and also contributes to recent massive gathering of information can shed light on customer behavior.
all sorts of information. This retrieval of data as and when
needed contributes the technology of data mining. Data D.Surveillance video and pictures
mining can be viewed as a result of the natural evolution of Storing the video tapes and digitizing them for
information technology. This technology provides a wide future use and analysis.
availability of huge amounts of data and the imminent need
for turning such data into useful information and knowledge. E.Satellite sensing
Data mining is the extraction of interesting patterns or There are a countless number of satellites around
knowledge from huge amount of data. It can be known by the globe and all send a non-stop stream of data to the
different names like knowledge discovery (mining) in surface. Many satellite pictures and data are made public as
databases (KDD), knowledge extraction, data/pattern soon as they are received in the hopes that other researchers
analysis, data archeology, data dredging, information can analyze them.
harvesting, business intelligence and others. The term “data
mining” is nothing but analysis of data in a database using F.Text reports and memos (e-mail messages)
tools which look for trends or anomalies without the Most of the communications are based on reports
knowledge of meaning of the data and is primarily used by and memos in textual forms often exchanged by e-mail.

978-1-61284-810-5/11/$26.00 ©2011 IEEE 27


A Conceptual Overview of Data Mining
These messages are regularly stored in digital form for user to interact with the system by specifying a data mining
future use and reference creating formidable digital libraries. query or task, providing information to help focus the
search, and performing exploratory data mining based on the
G. The World Wide Web repositories intermediate data mining results. This component also
In the World Wide Web documents of all sorts of allows the user to browse database and data warehouse
formats, content and description are collected and inter- schemas or data structures, evaluate mined patterns, and
connected with hyperlinks making it the largest repository of visualize the patterns in different forms.
data. Despite of its dynamic and unstructured nature, its
heterogeneous characteristic, and its very often redundancy
and inconsistency, the World Wide Web is the most
important data collection regularly used for reference
because of the broad variety of topics covered and the
infinite contributions of resources and publishers.

II.ARCHITECTURE AND PROCESS OF DATA


MINING

A. The Architecture of Data Mining


The architecture of a typical data mining system
has the following major components:

1) Database, data warehouse, or other information Fig 1: Architecture of a typical data mining system.
repository: This component is one or a set of databases, data
warehouses, spread sheets, or other kinds of information
repositories. Data cleaning and data integration techniques B. The Process of Data Mining
may be performed on the data.
The Knowledge Discovery in Databases process
2) Database or data warehouse server: The component comprises of a few steps leading from raw data collections
is responsible for fetching the relevant data, based on the to some form of new knowledge. The iterative process
data mining request of the user. consists of the following steps:
1) Data cleaning: It can also be termed as data
3) Knowledge base: This is the domain knowledge that cleansing. It is a phase wherein noise data and irrelevant
is used to guide the search, or evaluate the interestingness of data are removed from the collection.
resulting patterns. It includes concept hierarchies that are
used to organize attributes or attribute values into different 2) Data integration: In this stage multiple data sources
levels of abstraction. that are heterogeneous can be combined in a common
source.
4) Data mining engine: This is an essential component
of the data mining system and ideally consists of a set of 3) Data selection: In this phase the data relevant to the
functional modules for tasks such as characterization, analysis is decided on and retrieved from the data collection.
association analysis, classification, evolution and deviation
analysis. 4) Data transformation: It can also be known as data
consolidation and it is a phase in which the selected data is
5) Pattern evaluation module: This component typically transformed into forms appropriate for the mining
employs interestingness measures and interacts with the data procedure.
mining modules so as to focus the search towards interesting
patterns. It can access interestingness thresholds stored in 5) Data mining: This is a crucial phase wherein skillful
the knowledge base. Alternatively, the pattern evaluation techniques are applied to extract patterns potentially useful.
module may be integrated with the mining module, Pattern evaluation: In this current phase strictly interesting
depending on the implementation of the data mining method patterns representing knowledge are identified with respect
used. Efficient data mining is possible by pushing the to the given measures.
evaluation of pattern interestingness deeply into the mining
process so as to connect the search to only the interesting 6) Knowledge representation: This is the final phase in
patterns. which the discovered knowledge is visually represented to
the user. This is an essential step which uses visualization
6) Graphical user interface: This module communicates techniques to help users understand and interpret the data
between users and the data mining system, and allows the mining results.

28
Proceedings of the National Conference on Innovations in Emerging Technology-2011.

in the World Wide Web is organized in inter-connected


documents. These documents can be text, audio, video, raw
data, and even applications. Data mining in the World Wide
Web, or web mining, is often divided into web content
mining, web structure mining and web usage mining.

III.FUNCTIONALITIES AND CLASSIFICATIONS OF


DATA MINING

The data mining functionalities and the variety of


knowledge they discover are briefly presented in the
following:

A. Characterization
Fig2: Data mining as a process of knowledge discovery
Data characterization is a summarization of general
features of objects in a target class, and produces what is
The KDD is an iterative process. Once the discovered called characteristic rules. The data relevant to a user-
knowledge is presented to the user, the evaluation measures specified class are normally retrieved by a database query
can be enhanced, the mining can be further refined, new data and run through a summarization module to extract the
can be selected or further transformed, or new data sources essence of the data at different levels of abstractions.
can be integrated, in order to get different, more appropriate
results. B. Discrimination
Data discrimination produces what are called
The types of data mined are as follows: discriminant rules and is basically the comparison of the
1) Flat files: Flat files are simple data files in text general features of objects between two classes referred to as
or binary format with a structure known by the data mining the target class and the contrasting class.
algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements. C. Association analysis
Association analysis is the discovery of what are
2) Relational Databases: Briefly, a relational commonly called association rules. It studies the frequency
database consists of a set of tables containing either values of items occurring together in transactional databases, and
of entity attributes, or values of attributes from entity based on a threshold called support, identifies the frequent
relationships. Tables have columns and rows, where item sets. Another threshold, confidence, which is the
columns represent attributes and rows represent tuples. conditional probability than an item appears in a transaction
when another item appears, is used to pinpoint association
3) Data Warehouses: A data warehouse as a rules. Association analysis is commonly used for market
storehouse is a repository of data collected from multiple basket analysis.
data sources (often heterogeneous) and is intended to be
used as a whole under the same unified schema. A data D. Classification
warehouse gives the option to analyze data from different Classification analysis is the organization of data in
sources under the same roof. given classes. Also known as supervised classification, the
classification uses given class labels to order the objects in
4) Transaction Databases: A transaction database the data collection. Classification approaches normally use a
is a set of records representing transactions, each with a time training set where all objects are already associated with
stamp, an identifier and a set of items. Associated with the known class labels. The classification algorithm learns from
transaction files could also be descriptive data for the items. the training set and builds a model. The model is used to
classify new objects.
5) Multimedia Databases: Multimedia databases
include video, images, audio and text media. They can be E. Prediction
stored on extended object-relational or object-oriented There are two major types of predictions: one can
databases, or simply on a file system. either try to predict some unavailable data values or pending
6) Spatial Databases: Spatial databases are trends, or predict a class label for some data and is tied to
databases that store geographical information like maps, and classification. Once a classification model is built based on a
global or regional positioning in addition to usual data. training set, the class label of an object can be foreseen
7) World Wide Web: The World Wide Web is the based on the attribute values of the object and the attribute
most heterogeneous and dynamic repository available. Data values of the classes. Prediction is referred to the forecast of
missing numerical values, or increase/ decrease trends in

29
A Conceptual Overview of Data Mining
time related data. The major idea is to use a large number of systems offer several data mining functionalities together
past values to consider probable future values. and tend to be comprehensive systems.

F. Clustering D. Classification according to mining techniques used


It is the organization of data in classes and is Different techniques are employed and provided in
similar to classification. In clustering class labels are data mining systems. The classification categorizes data
unknown and it is up to the clustering algorithm to discover mining systems according to the data analysis approach used
acceptable classes. Clustering is also called unsupervised such as machine learning, neural networks, genetic
classification as the classification is not dictated by given algorithms, statistics, visualization, database oriented or data
class labels. There are many clustering approaches based on warehouse-oriented and others. The classification can also
the principle of maximizing the similarity between objects in take into account the degree of user interaction involved in
same class called intra-class and minimizing the similarity the data mining process such as query-driven systems,
between objects of different classes called inter-class interactive exploratory systems, or autonomous systems. A
similarity. comprehensive system would provide a wide variety of data
mining techniques to fit different situations and options, and
G. Outlier analysis offer different degrees of user interaction.
Outliers are data elements that cannot be grouped in
a given class or cluster. They are also known as exceptions IV.BENEFITS AND LIABILITIES OF DATA MINING
or surprises and are often very important to identify. Outliers AND ITS APPLICATIONS
can reveal important knowledge in other domains, can be
very significant and their analysis valuable. There are many benefits that can be obtained
through the application of data mining technology. Some of
H. Evolution and deviation analysis: the benefits of data mining are as follows:
This pertains to the study of time related data that • Helps to unearth facts about customers from your
change with time. Evolution analysis model’s evolutionary database, which you previously didn’t know about,
trends in data, which consent to characterizing, comparing, including purchasing behavior.
classifying or clustering of time related data. Deviation • Lends automation benefits to existing hardware and
analysis considers differences between measured values and software.
expected values, and attempts to find the cause of the • Crediting/Banking: helpful to financial institutions
deviations from the anticipated values. in such areas as loan information and credit
reporting.
Many data mining systems are available or are
• Research: makes the process of data analysis faster.
being developed, among which some are specialized systems
• Law enforcement: can assist law enforcers with
dedicated to a given data source or are confined to limited
keying out criminal suspects and taking them into
data mining functionalities and other are more versatile and
custody, by looking into trends in various behavior
comprehensive. Data mining systems can be categorized
patterns.
according to various criteria their classification is as follows:
• Marketing: helps to foretell the products which
A. Classification according to the type of data source mined customers would like to buy.
This classification categorizes data mining systems • Transportation: to evaluate loading patterns.
according to the type of data handled such as spatial data, • Medicine: to discover effective medical therapies
multimedia data, time-series data, text data, World Wide for diverse illnesses.
Web. • Insurance: to make out fraudulent behavior.
• Enhances efficiency and saves money.
B. Classification according to the data model drawn on
This classification categorizes data mining systems Data mining is an emerging trend and ubiquitous
based on the data model involved such as relational and before it develops into a conventional, mature and
database, object-oriented database, data warehouse, trusted discipline, many still pending issues have to be
transactional and others. addressed. Some of these issues are:

C. Classification according to the king of knowledge A. Security and social issue


discovered Security is an important issue with any data
This classification categorizes data mining systems collection when it is shared and is intended to be used for
based on the kind of knowledge discovered or data mining strategic decision-making. This becomes controversial given
functionalities, such as characterization, discrimination, the confidential nature of some of this data and the potential
association, classification, clustering and others. Some illegal access to the information. Data mining could disclose

30
Proceedings of the National Conference on Innovations in Emerging Technology-2011.

new implicit knowledge about individuals or groups that B. Surveillance / Mass surveillance
could be against privacy policies, especially if there is Surveillance is the monitoring of the behavior,
potential dissemination of discovered information. There activities, or other changing information, usually of people
arises another issue from this concern that is the and often in a surreptitious manner. It most usually refers to
appropriate use of data mining. Due the competitive observation of individuals or groups by government
advantage attained from implicit knowledge discovered, organizations, but disease surveillance, for example, is
some of the important information could be withheld and monitoring the progress of a disease in a community.
other information can be widely distributed and can be used
without control. C. National Security Agency

B. User interface issues The National Security Agency/Central Security


The knowledge discovered by data mining tools is Service (NSA/CSS) is a crypto logic intelligence agency of
useful as long as it is interesting, and above all the United States Department of Defense responsible for the
understandable by the user. The major issues related to user collection and analysis of foreign communications and
interfaces and visualization is “screen real-estate”, foreign signals intelligence, as well as protecting U.S.
information rendering, and interaction. Interactivity with the government communications and information systems which
data and data mining results is crucial since it provides involves cryptanalysis and cryptography.
means for the user to focus and refine the mining tasks, as
well as to picture the discovered knowledge from different 1) Quantitative structure-activity relationship: is
angles and at different conceptual levels the process by which chemical structure is quantitatively
correlated with a well defined process, such as biological
C. Mining methodology issues activity or chemical reactivity.
These issues pertain to the data mining approaches 2) Customer analytics: Customer analytics is a
applied and their limitations. More than the size of data, the process by which data from customer behavior is used to
size of the search space is even more decisive for data help make key business decisions via market segmentation
mining techniques. The size of the search space is often and predictive analytics.
depending upon the number of dimensions in the domain 3) Police-enforced ANPR in the UK: The UK has
space. The search space usually grows exponentially when an extensive automatic number plate recognition (ANPR)
the number of dimensions increases. This is known as the CCTV network. Police and security services use it to track
curse of dimensionality. This “curse” affects so badly the UK vehicle movements in real time. The resulting data are
performance of some data mining approaches that it is stored for 5 years in the National ANPR Data Centre to be
becoming one of the most urgent issues to solve. analyzed for intelligence and to be used as evidence.
4) Stellar wind (code name): Stellar Wind is the
D. Performance issues open secret code name for certain information collection
Many artificial intelligence and statistical methods activities performed by the United States' National Security
are there for data analysis and interpretation and are often
Agency.
not designed for the very large data sets data mining deals
with. This raises the issues of scalability and efficiency of
V. CONCLUSION
the data mining methods when processing considerably large
data. Other topics in the issue of performance are
Data mining is a technique that offers great promise
incremental updating, and parallel programming.
in helping organizations uncover patterns hidden in their
data that can be used to predict the behavior of customers,
E. Data source issues
products and processes. However, data mining tools need to
There are many issues related to the data sources,
be guided by users who understand the business, the data,
some are practical such as the diversity of data types, while
and the general nature of the analytical methods involved.
others are philosophical like the data glut problem.
Realistic expectations can yield rewarding results across a
Heterogeneous data sources, at structural and semantic
wide range of applications, from improving revenues to
levels, pose important challenges not only to the database
reducing costs. Regarding the practical issues related to data
community but also to the data mining community.
sources, there is the subject of heterogeneous databases and
the focus on diverse complex data types. We are storing
different types of data in a variety of repositories. It is
Some of the applications of data mining are:
difficult to expect a data mining system to effectively and
efficiently achieve good mining results on all kinds of data
A. Data Mining in Agriculture
and sources. Different kinds of data and sources may require
Recent technologies are nowadays able to provide
distinct algorithms and methodologies. Currently, there is a
a lot of information on agricultural-related activities, which
focus on the motivation or the need for data mining. We
can then be analyzed in order to find important information

31
A Conceptual Overview of Data Mining
have given a brief explanation about the typical architecture
of data mining and explained the steps of the data mining
process. This paper abstracts the functionalities of data
mining and describes the classification of data mining
systems. It spills the lime light on the benefits the data
mining technology offers to the present day world. We also
discus about the major issues that need to be addressed and
mention a few applications wherein data mining technology
can be applied. Therefore, from a strategic perspective, the
need to navigate the rapidly growing universe of digital data
will rely heavily on the ability to effectively manage and
mine the raw data.

REFERENCES

[1] Han J. and M. Kamber (2000), Data Mining: Concepts and


Techniques, Academic Press, San Diego, CA.
[2] Introduction to Data Mining and Knowledge Discovery, Third
Edition by Two Crows Corporation.
[3] DATA MINING: ACONCEPTUAL OVERVIEW
Joyce Jackson, Management Science Department, University of
South Carolina, [email protected]
[4] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview
from a database perspective. IEEE Trans. Knowledge and Data
Engineering, 8:866-883, 1996.
[5] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy.Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press, 1996.
[6] Tan, Steinbach, Kumar “Introduction to Data Mining”
[7] Data Mining: Introductory and Advanced Topics Margaret
Dunham

32

You might also like