Data Warehousing & Data
Mining
1
B S C . C S I T , 7 TH S E M
UNIT: 2 INTRODUCTION TO DATA MINING
Motivation For Data Mining
2
Data Mining is defined as the process of extracting information or
knowledge from huge amount of data.
Data mining is mining knowledge from the data.
The major reason that data mining has attracted a great deal of
attention in information industry in recent years is due to the wide
availability of huge amounts of data and the imminent need for turning
such data into useful information and knowledge.
The data available in the industry has no use until it is converted into
useful information.
So, it is necessary to analyze huge amount of data and extract useful
information from it.
The information and knowledge gained can be used for applications
ranging from business management, production control, and market
analysis, to engineering design and science exploration.
Introduction:
Data Mining
3
Data mining is the process of discovering interesting patterns and
knowledge from the huge amount of data.
Data mining is also known as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data
dredging.
Data Mining is one of the essential step in the process of
KDD(Knowledge Discovery from Data ).
The information or knowledge extracted by data mining can be used
for applications like:
• Market Analysis
• Fraud Detection
•Customer Retention
• Production Control
• Science Exploration
What kinds of data can be mined by data mining?
4
A data mining can mined the
1. Database data
Database management system consists of collections of
inter-related data, known as database and a set of
program to access those data.
A relational database consists of collection of tables, each
of which is assigned a unique name.
The table consists of rows and columns. The row
contains a large set of tuples (records) and the column
contains a set of attributes(fields).
Data to be Mined
5
2. Data warehouse data
A data warehouse is a repository of information collected from multiple,
heterogeneous sources and placed in a single site.
A data warehouse is a subject oriented, integrated, time variant and non-
volatile collection of data that helps in the management decision making
process.
Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading and periodic data refreshing.
A data warehouse is usually modeled by a multi-dimensional data
structure called data cube which allows data to be modeled and viewed in
a multiple dimension.
Each data cube consists of dimensions which corresponds to an attribute
or a set of attributes in the schema and a Cell stores the value of some
aggregated measure.
Data to be Mined
6
3. Transactional data
A transaction data base consists of a transactions like customer
purchase, flight bookings etc..
A transaction typically includes a unique transaction identifier
(TID) and a set of items associated with that transaction.
4. Other kinds of data
a) Time series data
b) Spatial data
c) Multi media data
d) Web data etc…
Functionalities of Data Mining
7
The data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks.
In general, such tasks can be classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize properties of the
data in a target data set.
Predictive mining tasks perform inducti0n on the current
data in order to make predictions.
Data mining functionalities and the kinds of patterns they
discover are:
Functionalities of Data Mining
8
1.Class/Concept Description: Characterization and Discrimination:
Data entries can be associated with classes or concept.
For example, in the all electronics store, classes of items for sale
include computes and printers, and concepts of customers include big
spenders and budget spenders.
Data characterization
is a summarization of the general characteristics or features of a target
class of data. There are several methods for effective data
characterization.
Simple data summaries based on statistical measures and plots
The data cube- based OLAP roll-up operation
The output of data characterization can be presented in various form
likes pie charts, bar charts, curves, multidimensional data cube.
Functionalities of Data Mining
9
Functionalities of Data Mining
10
Data discrimination is a comparison of the general
features of the target class data objects against the general
features of objects from one or multiple contrasting
classes.
The target and contrasting classes can be specified by a
user.
The forms of output presentation are similar to those for
characteristic description, although discrimination
description should include comparative measures that help
to distinguish between the target and contrasting classes.
Functionalities of Data Mining
11
Functionalities of Data Mining
12
2. Mining Frequent Patterns, Associations:
Frequent patterns are patterns that occur frequently in data.
There are many kinds of frequent patterns: frequent item-set,
frequent subsequences.
A frequent item-set typically refers to a set of items that often
appear together in a transactional data set. For example, milk
and bread, which are frequently bought together in grocery
stores by many customers.
A frequent subsequence, such as the pattern that customers
tend to purchase first a laptop, followed by a digital camera
and then a memory card.
Mining frequent patterns leads to the discovery of interesting
associations.
Functionalities of Data Mining
13
Functionalities of Data Mining
14
3.Classification :
Classification is the process of finding a model(functions) that describe
and distinguish data classes or concepts for future prediction.
Classification uses predefined classes in which objects are assigned.
The model are derived based on the analysis of a set of training
data(data objects for which the class labels are known).
It can be represented in form such as if-then rules, decision trees,
neural networks.
Functionalities of Data Mining
15
Functionalities of Data Mining
16
5.Cluster analysis:
Unlike classification, which analyze class label(training) data sets, clustering
analyzes data objects without consulting class label.
In many cases, class label data may simply not exist at the beginning.
Clustering identifies similarities between objects, which it groups according to
those characteristics in common and which differentiate them from other
groups of objects.
Clustering can be used to generate class labels for a group of data.
The objects are clustered on the principle of Maximizing intra-class similarity
& minimizing interclass similarity Clusters of objects are formed so that
objects within a cluster have high similarity in comparison to one another, but
are rather dissimilar to objects in other cluster.
Functionalities of Data Mining
17
Functionalities of Data Mining
18
6.Outlier analysis
Outlier: Data object that does not comply with the general behavior or
model of the data.
Many data mining methods discard outliers as noise or exceptions.
However, in some application the rare events can be more interesting than
the more regularly occurring ones. Useful in fraud detection, rare events
analysis.
Outliers may be detected using statistical tests that assume a distribution or
probability model for the data or using distance measures where objects that
are remote from any other cluster are considered outliers.
KDD (Knowledge Discovery in Database)
19
Knowledge Discovery in a database is the process of discovering useful
knowledge from a collection of data.
Knowledge discovery consist of an iterative sequence of the following
steps:.
1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Presentation
Data Mining is one of the essential step in the process of KDD.
Stages of KDD
20
Data mining as an essential step in the process of KDD
Stages of KDD Contd…
21
Data Cleaning- Used to remove noise or inconsistent data.
Data Integration- where data from multiple heterogeneous
sources are combined.
Data Selection- where data relevant to the analysis task
are retrieved from the database.
Data Transformation- where data are transformed and
changed into the form appropriate for mining by
performing summary or aggregation operations.
Stages of KDD Contd…
22
Data Mining- an essential process where intelligent
methods are applied to extract data patterns.
Pattern evaluation- used to identify truly interesting
patterns representing knowledge based on the
interestingness measure.
Knowledge presentation- where visualization and
knowledge presentation techniques are used to present
mined knowledge to user.
Data objects and attribute types
23
A data objects represents an entity – in a sales database, the
objects may be customers, items, sales.
Data objects are typically described by attributes.
An attribute is a data field, representing a characteristic or
feature of a data object.
Attributes describing a customer object can include
customer_id, name, address.
Data objects can also be referred to as samples, examples
instances or objects.
The type of attribute is determined by the set of possible
values- nominal, binary, ordinal, or numeric-the attribute can
have.
Types of Attributes
24
1. Nominal attributes:
The value of nominal attributes are symbols or names of
things.
Each value represents some kind of category, code or
state.
For example: suppose that hair_color and marital_status
are two attributes describing person objects. In our
application, possible values of hair_color are black,
brown, white. The attribute marital_status can take on the
values single, married, divorced. Both hair_color and
marital_status are nominal attributes.
Types of Attributes
25
2.Binary attributes:
A binary attributes is a nominal attributes with only two categories or
states: 0 or 1 where 0 typically means that the attribute is absent and 1
means it is present.
Example: given a attribute smoker describing a patient object, 1 indicates
that the patient smokes, while 0 indicates that the patient does not.
3.Ordinal attributes:
An ordinal attributes is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
Example: suppose that drink_size corresponds to the size of drinks
available at a restaurant. This nominal attribute has three possible values:
small, medium and large. The value have a meaningful sequence however,
we cannot tell them from the values how much bigger, say, a medium is
than a large.
Types of Attributes
26
4. Numeric attributes:
A numeric attributes is quantitative i.e. it is a measurable quantity,
represented in integer or real values.
It can be interval scaled or ratio scaled.
Interval scaled attributes provide a ranking of values, such attributes
allows us to compare and quantify the difference between values.
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
If a measurement is ratio scaled, we can speak of a value as being a
multiple (ratio) of another value.
Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time
to run a race)
Note: The Fahrenheit scale for temperature has an arbitrary zero point and is
therefore not a ratio scale. However, zero on the Kelvin scale is absolute zero. This
makes the Kelvin scale a ratio scale.
Types of Attributes
27
5. Discrete versus Continuous Attributes:
A discrete attributes has a finite or countably infinite set of
values , which may or maynot be represented as integers.
The attributes hair_color, medical-test, drink_size each have
finite number of values and so are discrete.
An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one to one
correspondence with natural number.
For example, the attribute customer_id is countably infinite.
The number of customers can grow to infinity but in reality the
actual set of values is countable.
If an attribute is not discrete it is continuous. Continuous
attributes are typically represented as floating point variables.
Basic Statistical Description of Data
28
The Basic statistical descriptions can be used to identify
properties of the data and highlight which data values
should be treated as noise or outliers.
The three areas of basic statistical descriptions are
1. Measure of Central Tendency
- Measures the location of the middle or centre of a data
distribution.
- Measure of central tendency includes: mean, median,
mode and midrange(The midrange of a data set i s the
average of the minimum and maximum values) .
Basic Statistical Description of Data
29
2. Dispersion of the data
- Measures how the data are spread out.
- The common data dispersion measures are: range,
quartiles, and inter quartile range ; the five-number
summary (the Five-Number Summary of a data set
is a five-item list comprising the minimum value,
first quartile, median, third quartile, and maximum
value of the set) and box plots; and the variance and
standard deviation of the data.
- These measures are useful for identifying outliers.
Basic Statistical Description of Data
30
3. Graphical Data Presentation
- These are used to visually inspect our data.
- Most statistical or graphical data presentation software
packages include bar charts, pie charts, and line graphs.
- Other popular displays of data summaries and
distributions include quantile plots, quantile–quantile
plots, histograms and scatter plots.
Note: For more details go through kambler book unit 2.2.
Issues in data mining
31
In data mining, the algorithm used is complex and data
is not available from single sources so these factors also
create some issues.
The major issues are
1) Mining Methodology and User Interaction
2) Performance Issues
3) Diverse Data Types Issues
Data Mining issues
32
Mining Methodology and User Interaction Issues
33
a) Mining different kinds of knowledge in databases - Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
b) Interactive mining of knowledge at multiple levels of abstraction -The data
mining process should be highly interactive. Thus it is important to build flexible
user interfaces and exploratory mining environment, facilitating the users
interaction with the system. Interactive mining should allow users to dynamically
change the focus of the search, to refine mining request based on the returned
results.
c) Incorporation of background knowledge – Background knowledge, constraints,
rules, and other information regarding the domain under study should be
incorporated into the knowledge discovery process. Such knowledge can be used for
pattern evaluation as well as to guide the search toward interesting patterns.
d) Data mining query languages and ad hoc data mining - Query language (SQL)
have played and important role in flexible searching because they allows the user to
describe ad hoc mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
Mining Methodology and User Interaction Issues
34
e) Presentation and visualization of data mining results - how can a data mining
system present data mining result vividly and flexibly so that the discovered knowledge
can be easily understood and directly usable by human?
Once the patterns are discovered it needs to be expressed in high level languages, and
visual representations. These representations should be easily understandable.
f) Handling noisy or incomplete data - The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
g) Pattern evaluation - Not all the patterns generated by mining process are interesting.
What makes the pattern interesting may vary from user to user. Therefore, techniques
are needed to assess the interestingness of discovered pattern based on subjective
measures.
The patterns discovered should be interesting because either they represent common
knowledge or lack novelty
Performance Issues
35
a) Efficiency and scalability of data mining algorithms - In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable. In other
words, the running time of a data mining algorithm must be predictable,
short and acceptable by application.
b) Parallel, distributed, and incremental mining algorithms - The factors
such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel
and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged.
In addition, the high cost of some data mining processes and the
incremental nature of input promote incremental data mining, which
incorporates new data updates without having to mine the entire data
from scratch.
Diverse Data Types Issues
36
a) Handling of relational and complex types of data - The
database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
b) Mining information from heterogeneous databases and
global information systems - The data is available at different
data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.
Applications of Data Mining
37
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM), market
basket analysis, cross selling (Cross-selling is a sales technique involving
the selling of an additional product or service to an existing customer),
market segmentation.
Risk analysis and management
Forecasting, customer retention, quality control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
38
END