Unit1: Introduction to
Data Mining
1. Definition of Data Mining:
• Data Mining is the process of discovering hidden
patterns, correlations, and useful information from
large datasets using statistical, machine learning, and
database techniques. It is a crucial step in the broader
process of Knowledge Discovery in Databases
(KDD).
2. Data Mining Issues:
• Data Quality: Missing, noisy, or inconsistent data can affect
results.
• Scalability: Managing very large datasets requires efficient
algorithms.
• High Dimensionality: Handling data with many attributes can
be computationally intensive.
• Data Privacy and Security: Protecting sensitive information
is critical.
• Interpretability of Results: Making sure mined knowledge is
understandable and actionable.
• Data Integration: Combining data from multiple sources with
different formats.
• Dynamic Data(Changing data): Dealing with continuously
changing data (stream data).
• Human interaction: there is need for proper interface between
the domain expert and users.
• Overfitting: Overfitting occurs when the generated model is
well suited for the training data set and it is not suited for the test
data set or future data set.
• Outliers : When model is derived, there are some values of data
that do not fit in the model. These values significantly different
from the normal values, or they don’t fit in any cluster.
• Multimedia data
• Irrelevant data
• Missing data
• Noisy data
• Application
3. Stages of the Data Mining Process (KDD):
The KDD (Knowledge Discovery in Databases) process typically
consists of the following steps:
• Data Selection: Identifying the relevant data from various
sources.
• Data Preprocessing: Cleaning and transforming data to
handle noise, missing values, etc.
• Data Transformation: Converting data into suitable formats
for mining (e.g., normalization).
• Data Mining: Applying techniques to extract patterns from
the data.
• Pattern Evaluation: Identifying truly interesting patterns
using measures of interest.
• Knowledge Representation: Presenting the mined knowledge
in a useful format (e.g., graphs, rules, charts).
4. Data Mining Techniques/Tasks:
Data Mining
Predictive Descriptive
Clustering Summarization
Classification Regression
Sequence
Time series Association
Prediction Discovery
Analysis Rules
Predictive: These tasks give the model based on data and
predict the future trends related to that data or unknown values
that may be of interest for the future
1. Classification: Assigning items to predefined categories (e.g.,
spam detection).
2. Regression: Predicting a continuous value (e.g., house
prices).
3. Time series analysis: The process of recording of the data
point at specific time intervals. (e.g., weather record, stock
market analysis)
4. Prediction: prediction is classification task. Prediction
discovers the relationship between dependent variable and
independent variable.
Descriptive: The tasks include the analysis of available data
patterns or models to find out new interesting and information based
on available data set.
1. Clustering: Cluster analysis is the method where the data points
are grouped together according to their characteristics. Clustering
can be used in outlier detection. (e.g., in life sciences similar
character genes)
2. Association Rules: Association Rules find out the correlation
among the data. Association Rules find out specific type of
association between the data item. (Market Basket analysis).
3. Summarization: Summarization is also called as characterization
or Generalization. It extracts or derives representative information
about the database.
4. Sequence Discovery: Sequence Discovery is a data mining
technique that discovers statistically relevant patterns in sequential
data.(e.g., most people who purchase CD players may be found to
purchase CDs within one week)
5. Knowledge Representation Methods:
After mining, knowledge is represented using:
• Graphical: this is a traditional graph structures including bar
charts, pie charts, histograms, and line graphs may be used.
• Geometric: geometric technique includes the box plot and
scatter diagram techniques.
• Icon-based: this technique using figures, colors or other icons
can improve the presentation of the results.
• Pixel-based: With these technique, each data value is shown
as a uniquely colored pixel.
• Hierarchical: these techniques hierarchically divide the
display area(screen) into regions based on data values.
• Hybrid: The preceding approaches can be combined into one
display.
6. Applications of Data Mining:
Data mining is applied across various fields:
• Business: Customer segmentation, sales forecasting, market
basket analysis.
• Healthcare: Disease prediction, patient monitoring, drug
discovery.
• Finance: Credit scoring, fraud detection, risk management.
• Retail: Inventory management, recommendation systems.
• Telecommunications: Churn prediction, network
optimization.
• E-commerce: Personalized marketing, behavior analysis.
• Education: Student performance analysis, dropout prediction.
• Scientific Research: Pattern discovery in scientific data.