INTRODUCTION
TO DATA
MINING
Unit-7
SCOPE OF DATA MINING
Data mining refers to the process of discovering patterns, relationships, and
insights from large datasets. It involves various techniques from statistics,
machine learning, and database systems. The key objectives of data mining
include:
• Identifying hidden patterns in data
• Predicting future trends and behaviors
• Enhancing decision-making processes
• Extracting useful knowledge from vast amounts of information
MAJOR APPLICATIONS OF DATA MINING
• Business intelligence and market analysis
• Fraud detection and risk management
• Healthcare analytics
• Scientific discovery and research
• Social media and web analytics
DATA EXPLORATION AND REDUCTION
Sampling
• Sampling is a technique used to select a subset of data for analysis.
• It helps in reducing computational costs and improving efficiency.
• Common sampling methods:
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Cluster sampling
DATA VISUALIZATION
• Data visualization techniques help in understanding patterns, trends, and
outliers in datasets.
• Popular visualization tools include histograms, scatter plots, box plots, and
heatmaps.
• Effective visualization aids in data preprocessing and model selection.
DIRTY DATA
• Dirty data refers to incomplete, inconsistent, or incorrect data.
• Common issues:
Missing values
Duplicate records
Outliers
Data inconsistencies
• Cleaning techniques include imputation, normalization, and transformation.
CLUSTER ANALYSIS
• Cluster analysis is a technique used to group similar data points.
• Methods of clustering:
• K-means clustering
• Hierarchical clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• Used in market segmentation, image recognition, and anomaly
detection.
CLASSIFICATION
Intuitive Explanation of Classification
• Classification is a supervised learning technique used to
categorize new data based on previously known data.
• It involves assigning labels to data instances.
• Examples: spam detection, medical diagnosis, and sentiment
analysis.
MEASURING CLASSIFICATION PERFORMANCE
• Accuracy
• Precision and Recall
• F1-score
• Confusion matrix
• ROC curve and AUC
ACCURACY
• Definition: The proportion of correctly predicted instances (both true positives
and true negatives) out of the total number of instances.
• Useful when the classes are balanced. Not ideal for imbalanced datasets.
PRECISION AND RECALL
Precision: Recall (Sensitivity or True Positive
Rate):
Measures the proportion of
correctly predicted positive Measures the proportion of correctly
instances out of all predicted predicted positive instances out of all actual
positive instances.
positive instances. Focuses on minimizing false negatives.
Focuses on minimizing false positives.
Formula:
Formula
F1-SCORE
• Definition: The harmonic mean of precision and recall. It balances both
metrics.
• To balance precision and recall, especially in imbalanced datasets.
• Formula:
CONFUSION MATRIX
• Definition: A table that summarizes the performance of a classification model
by showing the counts of true positives, true negatives, false positives, and
false negatives.
• Provides a detailed breakdown of model performance, helping to identify
specific types of errors.
• Structure:
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
ROC CURVE AND AUC
ROC Curve (Receiver Operating AUC (Area Under the Curve):
Characteristic Curve): • Represents the area under the ROC
• Plots the True Positive Rate (Recall) curve.
against the False Positive Rate (FPR) at • Ranges from 0 to 1, where 1 indicates
various threshold settings. perfect classification and 0.5 indicates
• False Positive Rate (FPR): random guessing.
• Evaluates the trade-off between
sensitivity and specificity. AUC is useful
for comparing models.
USING TRAINING AND VALIDATION DATA
• Training data is used to build a classification model.
• Validation data is used to tune model parameters.
• Test data is used to evaluate model performance.
CLASSIFYING NEW DATA
• Once trained, the classification model is used to predict new, unseen data.
• The model assigns a label to each new data instance based on learned patterns.
CLASSIFICATION TECHNIQUES
K-Nearest Neighbors (K-NN)
• A simple, instance-based learning algorithm.
• Classifies a data point based on the majority class of its k-nearest neighbors.
• Works well for smaller datasets but can be computationally expensive for large datasets.
Discriminant Analysis
• Used for classifying observations into predefined categories.
• Types:
• Linear Discriminant Analysis (LDA)
• Quadratic Discriminant Analysis (QDA)
CONT.…..
Logistic Regression
• A statistical model used for binary classification.
• Estimates the probability of a class using the logistic function.
• Suitable for predicting categorical outcomes (e.g., pass/fail, spam/ham).
Association Rule Mining
• Discovers relationships between variables in large datasets.
• Used in market basket analysis to find items that frequently occur together.
• Key algorithms:
• Apriori Algorithm
• FP-Growth Algorithm
CAUSE AND EFFECT MODELING
• Used to understand causal relationships between variables.
• Techniques include:
• Regression analysis
• Granger causality
• Structural equation modeling
• Applications: economic forecasting, medical research, and policy evaluation.