0% found this document useful (0 votes)

39 views17 pages

Unit 3 DWM Notes

Data mining is the computational process of extracting knowledge from large datasets using techniques from artificial intelligence, machine learning, and statistics. It involves discovering patterns, predicting outcomes, and creating actionable information, with applications across various industries such as finance and healthcare. The data mining process includes data pre-processing, extraction, and evaluation, and encompasses both descriptive and predictive functionalities.

Uploaded by

Jayshree Borkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views17 pages

Unit 3 DWM Notes

Uploaded by

Jayshree Borkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amountsof data. The term
is actually a misnomer. Thus, data miningshould have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods
at the intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further use.

The key properties of data mining are

 Automatic discovery of patterns

 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases

Architecture of Data Mining

A typical data mining system may have the following major components.
1. Knowledge Base:

This is the domain knowledge that is used to guide the search orevaluate the
interestingness of resulting patterns. Such knowledge can include concept their
archies, DEPT OF CSE & IT VSSUT, Burla used to organize attributes or attribute
values into different levels of abstraction. Knowledge such as user beliefs, which
can be used to assess a pattern’s interestingness based on its unexpectedness,
may also be included. Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata (e.g., describing data
from multiple heterogeneous sources)

2. Data Mining Engine:

This is essential to the data mining system and ideally consists ofa set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:

This component typically employs interestingness measures interacts with the data
mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is
highly recommended to push the evaluation of pattern interestingness as deep as
possible into the mining process so as to confine the search to only the interesting
patterns.

User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query ortask, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. In addition,
this component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.

Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from
a given collection of data.

The general experimental procedure adapted to data-mining problems involves the

following
steps:

1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application
domain. Hence, domain-specific knowledge and experience are usually necessary
in order to come up with a meaningful problem statement. Unfortunately, many
application studies tend to focus on the data-mining technique at the expense of
a clear problem statement. In this step, a modeler usually specifies a set of
variables for the unknown dependency and, if possible, a general form of this
dependency as an initial hypothesis. There may be several hypotheses
formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In
practice, it usually means a close interaction between the data-mining expert and
the application expert. In successful data-mining applications, this cooperation
does not stop in the initial phase; it continues during the entire data-mining
process.

2. Collect the data

This step is concerned with how the data are generated and collected. In general,
there are two distinct possibilities. The first is when the data-generation process
is under the control of an expert (modeler): this approach is known as a designed
experiment. The second possibility is when the expert cannot influence the data-
generation process: this is known as the observational approach. An
observational setting, namely, random data generation, is assumed in most data-
mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and
implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a
priori knowledge can be very useful for modeling and, later, for the final
interpretation of results. Also, it is important to make sure that the data used for
estimating a model and the data used later for testing and applying a model
come from the same, unknown, sampling distribution. If this is not the case, the
estimated model cannot be successfully used in a final application of the results

Introduction to data mining

Data mining is the process of extracting useful information from

large sets of data. It involves using various techniques from
statistics, machine learning, and database systems to identify
patterns, relationships, and trends in the data. This information
can then be used to make data-driven decisions, solve business
problems, and uncover hidden insights. Applications of data
mining include customer profiling and segmentation, market
basket analysis, anomaly detection, and predictive modeling.
Data mining tools and technologies are widely used in various
industries, including finance, healthcare, retail, and
telecommunications.
In general terms, “Mining” is the process of extraction of some
valuable material from the earth e.g. coal mining, diamond
mining, etc. In the context of computer science, “ Data
Mining” can be referred to as knowledge mining from data,
knowledge extraction, data/pattern analysis, data archaeology,
and data dredging. It is basically the process carried out for the
extraction of useful information from a bulk of data or data
warehouses. One can see that the term itself is a little confusing.
In the case of coal or diamond mining, the result of the extraction
process is coal or diamond. But in the case of Data Mining, the
result of the extraction process is not data!! Instead, data mining
results are the patterns and knowledge that we gain at the end of
the extraction process. In that sense, we can think of Data Mining
as a step in the process of Knowledge Discovery or Knowledge
Extraction.
Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery
in Databases” in 1989. However, the term ‘data mining’ became
more popular in the business and press communities. Currently,
Data Mining and Knowledge Discovery are used interchangeably.
Nowadays, data mining is used in almost all places where a large
amount of data is stored and processed. For example, banks
typically use ‘data mining’ to find out their prospective customers
who could be interested in credit cards, personal loans, or
insurance as well. Since banks have the transaction details and
detailed profiles of their customers, they analyze all this data and
try to find out patterns that help them predict that certain
customers could be interested in personal loans, etc.
Main Purpose of Data Mining
Data Mining

Basically, Data mining has been integrated with many other

techniques from other domains such as statistics, machine
learning, pattern recognition, database and data warehouse
systems, information retrieval, visualization, etc. to gather more
information about the data and to helps predict hidden patterns,
future trends, and behaviors and allows businesses to make
decisions.
Technically, data mining is the computational process of
analyzing data from different perspectives, dimensions, angles
and categorizing/summarizing it into meaningful information.
Data Mining can be applied to any type of data e.g. Data
Warehouses, Transactional Databases, Relational Databases,
Multimedia Databases, Spatial Databases, Time-series Databases,
World Wide Web.
Data Mining as a Whole Process
The whole process of Data Mining consists of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection, and
transformation takes place
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analyzing and presenting
results
In future articles, we will cover the details of each of these
phases.
Applications of Data Mining
1. Financial Analysis
2. Biological Analysis
3. Scientific Analysis
4. Intrusion Detection
5. Fraud Detection
6. Research Analysis

Benefits of Data Mining

1. Improved decision-making: Data mining can provide valuable
insights that can help organizations make better decisions by
identifying patterns and trends in large data sets.
2. Increased efficiency: Data mining can automate repetitive and
time-consuming tasks, such as data cleaning and preparation,
which can help organizations save time and resources.
3. Enhanced competitiveness: Data mining can help organizations
gain a competitive edge by uncovering new business
opportunities and identifying areas for improvement.
4. Improved customer service: Data mining can help
organizations better understand their customers and tailor
their products and services to meet their needs.
5. Fraud detection: Data mining can be used to identify
fraudulent activities by detecting unusual patterns and
anomalies in data.
6. Predictive modeling: Data mining can be used to build models
that can predict future events and trends, which can be used to
make proactive decisions.
7. New product development: Data mining can be used to identify
new product opportunities by analyzing customer purchase
patterns and preferences.
8. Risk management: Data mining can be used to identify
potential risks by analyzing data on customer behavior, market
conditions, and other factors.

Real-Life Examples of Data Mining

Market Basket Analysis: It is a technique that gives the careful
study of purchases done by a customer in a supermarket. The
concept is basically applied to identify the items that are bought
together by a customer. Say, if a person buys bread, what are the
chances that he/she will also purchase butter? This analysis helps
in promoting offers and deals by the companies. The same is
done with the help of data mining.
Protein Folding: It is a technique that carefully studies biological
cells and predicts the protein interactions and functionality within
biological cells. Applications of this research include
determining causes and possible cures for Alzheimer’s,
Parkinson’s, and cancer caused by Protein misfolding.
Fraud Detection: Nowadays, in this land of cell phones, we can
use data mining to analyze cell phone activities for comparing
suspicious phone activity. This can help us to detect calls made
on cloned phones. Similarly, with credit cards, comparing
purchases with historical purchases can detect activity with stolen
cards.
Data mining also has many successful applications, such as
business intelligence, Web search, bioinformatics, health
informatics, finance, digital libraries, and digital governments.

Basic Data Mining Task & Functionalities of Data Mining

Data Mining functions are used to define the trends or correlations contained in
data mining activities. In comparison, data mining activities can be divided into 2
categories:
1] Descriptive Data Mining:
This category of data mining is concerned with finding patterns and relationships in
the data that can provide insight into the underlying structure of the data.
Descriptive data mining is often used to summarize or explore the data, and it can
be used to answer questions such as: What are the most common patterns or
relationships in the data? Are there any clusters or groups of data points that share
common characteristics? What are the outliers in the data, and what do they
represent?
Some common techniques used in descriptive data mining include:
Cluster analysis:
This technique is used to identify groups of data points that share similar
characteristics. Clustering can be used for segmentation, anomaly detection, and
summarization.
Association rule mining:
This technique is used to identify relationships between variables in the data. It can
be used to discover co-occurring events or to identify patterns in transaction data.
Visualization:
This technique is used to represent the data in a visual format that can help users
to identify patterns or trends that may not be apparent in the raw data.

2]Predictive Data Mining: This category of data mining is concerned with

developing models that can predict future behavior or outcomes based on historical
data. Predictive data mining is often used for classification or regression tasks, and
it can be used to answer questions such as: What is the likelihood that a customer
will churn? What is the expected revenue for a new product launch? What is the
probability of a loan defaulting?
Some common techniques used in predictive data mining include:
Decision trees: This technique is used to create a model that can predict the value
of a target variable based on the values of several input variables. Decision trees
are often used for classification tasks.
Neural networks: This technique is used to create a model that can learn to
recognize patterns in the data. Neural networks are often used for image
recognition, speech recognition, and natural language processing.
Regression analysis: This technique is used to create a model that can predict
the value of a target variable based on the values of several input variables.
Regression analysis is often used for prediction tasks.
Both descriptive and predictive data mining techniques are important for
gaining insights and making better decisions. Descriptive data mining can be used
to explore the data and identify patterns, while predictive data mining can be used
to make predictions based on those patterns. Together, these techniques can help
organizations to understand their data and make informed decisions based on that
understanding.

Data Mining Functionality:

1. Class/Concept Descriptions: Classes or definitions can be correlated with
results. In simplified, descriptive and yet accurate ways, it can be helpful to define
individual groups and concepts. These class or concept definitions are referred to
as class/concept descriptions.
 Data Characterization: This refers to the summary of general characteristics or
features of the class that is under the study. The output of the data
characterization can be presented in various forms include pie charts, bar
charts, curves, multidimensional data cubes.
Example: To study the characteristics of software products with sales increased
by 10% in the previous years. To summarize the characteristics of the customer
who spend more than $5000 a year at AllElectronics, the result is general profile of
those customers such as that they are 40-50 years old, employee and have
excellent credit rating.
 Data Discrimination: It compares common features of class which is under
study. It is a comparison of the general features of the target class data objects
against the general features of objects from one or multiple contrasting classes.
Example: we may want to compare two groups of customers those who shop for
computer products regular and those who rarely shop for such products(less than 3
times a year), the resulting description provides a general comparative profile of
those customers, such as 80% of the customers who frequently purchased
computer products are between 20 and 40 years old and have a university degree,
and 60% of the customers who infrequently buys such products are either seniors
or youth, and have no university degree.

2. Mining Frequent Patterns, Associations, and Correlations: Frequent patterns

are nothing but things that are found to be most common in the data. There are
different kinds of frequencies that can be observed in the dataset.
 Frequent item set: This applies to a number of items that can be seen together
regularly for eg: milk and sugar.
 Frequent Subsequence: This refers to the pattern series that often occurs
regularly such as purchasing a phone followed by a back cover.
 Frequent Substructure: It refers to the different kinds of data structures such
as trees and graphs that may be combined with the itemset or subsequence.
Association Analysis: The process involves uncovering the relationship between
data and deciding the rules of the association. It is a way of discovering the
relationship between various items.
Example: Suppose we want to know which items are frequently purchased

buys (X, “computer”) ⇒ buys (X, “software”) [support = 1%, confidence =

together. An example for such a rule mined from a transactional database is,

50%],
where X is a variable representing a customer. A confidence, or certainty, of 50%
means that if a customer buys a computer, there is a 50% chance that she will buy
software as well. A 1% support means that 1% of all the transactions under
analysis show that computer and software are purchased together. This
association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-

age (X, “20…29”) ∧ income (X, “40K..49K”) ⇒ buys (X, “laptop”)

dimensional association rules.

[support = 2%, confidence = 60%].

The rule says that 2% are 20 to 29 years old with an income of $40,000 to
$49,000 and have purchased a laptop. There is a 60% probability that a customer
in this age and income group will purchase a laptop. The association involving
more than one attribute or predicate can be referred to as a multidimensional
association rule.
Typically, association rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence threshold. Additional
analysis can be performed to uncover interesting statistical correlations between
associated attribute–value pairs.
Correlation Analysis: Correlation is a mathematical technique that can show
whether and how strongly the pairs of attributes are related to each other. For
example, Highted people tend to have more weight.
Data Mining Task Primitives
Data mining task primitives refer to the basic building blocks or components that
are used to construct a data mining process. These primitives are used to
represent the most common and fundamental tasks that are performed during the
data mining process. The use of data mining task primitives can provide a modular
and reusable approach, which can improve the performance, efficiency, and
understandability of the data mining process.

The Data Mining Task Primitives are as follows:

1. The set of task relevant data to be mined: It refers to the specific data that is
relevant and necessary for a particular task or analysis being conducted using
data mining techniques. This data may include specific attributes, variables, or
characteristics that are relevant to the task at hand, such as customer
demographics, sales data, or website usage statistics. The data selected for
mining is typically a subset of the overall data available, as not all data may be
necessary or relevant for the task. For example: Extracting the database
name, database tables, and relevant required attributes from the dataset from
the provided input database.
2. Kind of knowledge to be mined: It refers to the type of information or insights
that are being sought through the use of data mining techniques. This describes
the data mining tasks that must be carried out. It includes various tasks such as
classification, clustering, discrimination, characterization, association, and
evolution analysis. For example, It determines the task to be performed on the
relevant data in order to mine useful information such as classification,
clustering, prediction, discrimination, outlier detection, and correlation analysis.
3. Background knowledge to be used in the discovery process: It refers to
any prior information or understanding that is used to guide the data mining
process. This can include domain-specific knowledge, such as industry-specific
terminology, trends, or best practices, as well as knowledge about the data
itself. The use of background knowledge can help to improve the accuracy and
relevance of the insights obtained from the data mining process. For example,
The use of background knowledge such as concept hierarchies, and user
beliefs about relationships in data in order to evaluate and perform more
efficiently.
4. Interestingness measures and thresholds for pattern evaluation: It refers to
the methods and criteria used to evaluate the quality and relevance of the
patterns or insights discovered through data mining. Interestingness measures
are used to quantify the degree to which a pattern is considered to be
interesting or relevant based on certain criteria, such as its frequency,
confidence, or lift. These measures are used to identify patterns that are
meaningful or relevant to the task. Thresholds for pattern evaluation, on the
other hand, are used to set a minimum level of interestingness that a pattern
must meet in order to be considered for further analysis or action.
For example: Evaluating the interestingness and interestingness measures
such as utility, certainty, and novelty for the data and setting an appropriate
threshold value for the pattern evaluation.
5. Representation for visualizing the discovered pattern: It refers to the
methods used to represent the patterns or insights discovered through data
mining in a way that is easy to understand and interpret. Visualization
techniques such as charts, graphs, and maps are commonly used to represent
the data and can help to highlight important trends, patterns, or relationships
within the data. Visualizing the discovered pattern helps to make the insights
obtained from the data mining process more accessible and understandable to
a wider audience, including non-technical stakeholders.
For example Presentation and visualization of discovered pattern data using
various visualization techniques such as barplot, charts, graphs, tables, etc.
Advantages of Data Mining Task Primitives
The use of data mining task primitives has several advantages, including:
1. Modularity: Data mining task primitives provide a modular approach to data
mining, which allows for flexibility and the ability to easily modify or replace
specific steps in the process.
2. Reusability: Data mining task primitives can be reused across different data
mining projects, which can save time and effort.
3. Standardization: Data mining task primitives provide a standardized approach
to data mining, which can improve the consistency and quality of the data
mining process.
4. Understandability: Data mining task primitives are easy to understand and
communicate, which can improve collaboration and communication among
team members.
5. Improved Performance: Data mining task primitives can improve the
performance of the data mining process by reducing the amount of data that
needs to be processed, and by optimizing the data for specific data mining
algorithms.
6. Flexibility: Data mining task primitives can be combined and repeated in
various ways to achieve the goals of the data mining process, making it more
adaptable to the specific needs of the project.
7. Efficient use of resources: Data mining task primitives can help to make more
efficient use of resources, as they allow to perform specific tasks with the right
tools, avoiding unnecessary steps and reducing the time and computational
power needed.

Problem Identification

data mining is used in almost all places where a large amount of data is stored and
processed. Data Integration is one of the major tasks of data preprocessing.
Integration of multiple databases or data files into the single store of identical data
is known as Data Integration. Data Integration is usually performed to create data
sets for machine learning algorithms and to predict the statistical information from
the data during the data mining. We integrate data from various resources like
banking transactions, invoices, customer records, Twitter, blog postings, image,
audio or video data, electronic data interchange (EDI) files, spreadsheets, and
sensor data.
Data mining often requires data integration, the merging of data from multiple data
stores. which combines data from multiple sources into a coherent data store, as
in data warehousing. These sources may include multiple databases, data cubes,
or flat files. There are a number of issues to consider during data integration like
Schema integration and object matching.

So a careful integration can help reduce and avoid redundancies and

inconsistencies in the resulting data set. This can help improve the accuracy and
speed of the subsequent data mining process. The semantic heterogeneity and
structure of data pose great challenges in data integration. How can we match
schema and objects from different sources? Or How can equivalent real-world
entities from multiple data sources be matched up? This problem is known as the
entity identification problem.
Data is usually collected from multiple resources into a coherent store and it can be
of different dimensions and datatypes. There are different representations of data
and different scales of data.
Issues in Data Integration:
 Data redundancy: Redundant data occurs while we merge data from multiple
databases. If the redundant data is not removed incorrect results will be
obtained during data analysis. Redundant data occurs due to the following
reasons.
o Object identification: The same attribute or object may have
different names in different databases
o Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
 Duplicate data attributes: Duplicates are usually present in the information
contained in one or more other attributes.
 Irrelevant attributes: Some attributes in the data are not important and they
are not considered while performing the data mining tasks. There is no use in
having such irrelevant attributes in the data. For example, students’ ID is often
irrelevant to the task of predicting students’ GPA
 Entity Identification Problem: Equivalent real-world entities from multiple data
sources matched up are referred to this problem. Entity Identification Problem
occurs during the data integration. During the integration of data from multiple
resources, some data resources match each other and they will become
reductant if they are integrated. For example: A.cust-id =B.cust-number. Here A,
B are two different database tables .cust-id is the attribute of table A,cust-
number is the attribute of table B. Here cust-id and cust-number are attributes of
different tables and there is no relationship between these tables but the cust-id
attribute and cust-number attribute are taking the same values. This is the
example for Entity Identification Problem in the relation. Meta Data can be used
to avoid errors in such schema integration. This ensures that functional
dependencies and referential constraints in the source system match in the
target system. Entity Identification Problem helps in detecting and resolving
data value conflicts.

Data Mining Matrix

A Data Mining Matrix is a structured way to represent various aspects of data mining
techniques, algorithms, and their applications. It typically consists of rows and columns, where
different elements such as data types, mining tasks, and techniques are mapped.

Detailed Breakdown of a Data Mining Matrix

1. Components of a Data Mining Matrix

A Data Mining Matrix is structured using the following dimensions:
Dimension Description
The type of data being mined (e.g., structured, semi-structured,
Data Types
unstructured, text, multimedia, spatial, temporal, web data, etc.)
The goals of data mining, such as classification, clustering, association rule
Mining Tasks
mining, anomaly detection, regression, etc.
The specific techniques used to perform the mining tasks, such as Decision
Algorithms
Trees, K-Means, Apriori, DBSCAN, Random Forest, etc.
Measures used to assess the performance of algorithms, such as accuracy,
Evaluation Metrics
precision, recall, F1-score, support, confidence, and lift.
Real-world domains where the techniques are used, such as healthcare,
Applications
finance, marketing, cybersecurity, recommendation systems, etc.

2.Example Data Mining Matrix

A simplified example of a Data Mining Matrix can be represented as follows:

Data Type Mining Task Algorithm Evaluation Application

Metric

Structured Classification Decision Tree Accuracy, F1- Fraud

(Relational DB) score Detection

Unstructured Clustering K-Means Silhouette Score Document

(Text) Categoriza
tion
Transactional Association Apriori Support, Confidence Market
Data Rules Basket
Analysis
Time-Series Anomaly DBSCAN Precision, Recall Network
Data Detection Intrusion
Detection
Spatial Data Regression Random RMSE(Root Mean Climate
Forest Squared Error) Prediction

3. Significance of a Data Mining Matrix

 Comparison of Techniques: Helps compare different mining techniques
across various domains.
 Algorithm Selection: Assists in choosing the best algorithm based on data
type and mining task.
 Performance Analysis: Evaluates which algorithms work best for specific use
cases.
 Application Mapping: Links data mining techniques to practical applications
in industries.

4. How to Use a Data Mining Matrix in Research & Industry

 Problem Identification: Determine the mining task needed for a given
problem.
 Data Selection: Identify the type of data available.
 Algorithm Choice: Select an appropriate algorithm based on the task and data
type.
 Performance Evaluation: Use relevant evaluation metrics to measure
effectiveness.
 Implementation & Optimization: Deploy the model in real-world
applications and optimize as needed.

data cleaning (pre-processing, feature, selection, data reduction, feature

encoding, noise and missing values, etc)
Data Cleaning in Data Science

Data cleaning (or data preprocessing) is a crucial step in the data science pipeline. It ensures that the
dataset is accurate, consistent, and suitable for analysis or machine learning models. Below are the
key steps involved in data cleaning:

1. Data Preprocessing

This involves handling raw data to make it usable. It includes:

 Removing duplicates: Identifying and eliminating redundant records.
 Handling inconsistent formatting: Standardizing data formats (e.g., date formats,
case sensitivity).
 Fixing structural errors: Correcting typos, mislabeling, or incorrect values.

2. Feature Selection

Feature selection helps improve model performance by reducing irrelevant or redundant

features. Techniques include:

 Filter Methods: Using statistical tests like correlation or chi-square to remove

irrelevant features.
 Wrapper Methods: Using algorithms like Recursive Feature Elimination (RFE) to
iteratively select features.
 Embedded Methods: Algorithms like LASSO and Decision Trees perform feature
selection internally.

3. Data Reduction

Reducing the size of the dataset while retaining essential information:

 Dimensionality Reduction: Techniques like PCA (Principal Component Analysis)

and t-SNE help reduce features.
 Sampling: Using a subset of data to maintain performance while reducing
computational load.
 Aggregation: Combining data points to simplify datasets.

4. Feature Encoding

Converting categorical data into numerical values for machine learning models:

 One-Hot Encoding: Converts categorical variables into binary columns.

 Label Encoding: Assigns numerical values to categorical variables (e.g., “Low” → 0,
“Medium” → 1, “High” → 2).
 Ordinal Encoding: Used when categorical data has a meaningful order.

5. Handling Noise and Missing Values

 Noise Removal:
o Smoothing techniques (e.g., binning, moving averages).
o Outlier detection (e.g., IQR method, Z-score, DBSCAN clustering).
 Handling Missing Values:
o Deletion: Removing rows or columns with missing values (not always
recommended).
o Imputation: Replacing missing values with mean, median, mode, or using
predictive models (e.g., KNN imputation).

Key Issues and Opportunities in Data Mining

Key Issues in Data Mining:

1. Data Quality & Preprocessing:

o Handling missing, noisy, or inconsistent data
o Data integration from multiple sources
o Data transformation and normalization
2. Scalability & Performance:
o Processing large datasets efficiently
o High computational cost of algorithms
o Optimizing storage and retrieval mechanisms
3. Privacy & Security:
o Ensuring data confidentiality and compliance (GDPR, HIPAA)
o Protecting sensitive user information
o Ethical concerns regarding data collection and usage
4. Algorithm Selection & Optimization:
o Choosing appropriate algorithms for specific problems
o Balancing accuracy, interpretability, and complexity
o Handling high-dimensional and unstructured data
5. Interpretability & Explainability:
o Making AI and ML models understandable to users
o Explaining patterns and predictions in an actionable way
o Reducing bias and ensuring fairness in decision-making
6. Real-time Processing & Streaming Data:
o Handling dynamic and continuously changing data
o Ensuring timely analysis for decision-making
o Managing high-velocity data streams (e.g., IoT, social media)

Opportunities in Data Mining:

1. Business Intelligence & Market Analysis:

o Customer segmentation, sentiment analysis, and churn prediction
o Enhancing targeted marketing and recommendation systems
o Fraud detection and risk management
2. Healthcare & Bioinformatics:
o Disease prediction and medical diagnosis
o Personalized medicine and drug discovery
o Medical imaging analysis and patient care optimization
3. Cybersecurity & Fraud Detection:
o Identifying security breaches and anomalies
o Fraud detection in banking, e-commerce, and insurance
o Threat intelligence and intrusion detection
4. Smart Cities & IoT Applications:
o Traffic pattern analysis and optimization
o Predictive maintenance in smart infrastructure
o Energy consumption forecasting and smart grids
5. Education & E-Learning:
o Adaptive learning and personalized education
o Student performance prediction and dropout analysis
o Automated grading and feedback systems
6. Social Media & Sentiment Analysis:
o Opinion mining and trend detection
o Fake news detection and misinformation analysis
o Enhancing customer engagement and brand reputation
7. E-commerce & Recommendation Systems:
o Personalized product recommendations
o Dynamic pricing and demand forecasting
o Inventory management and supply chain optimization
8. Agriculture & Environmental Monitoring:
o Precision farming using satellite and sensor data
o Climate change analysis and disaster prediction
o Water resource management and pollution control

Data Mining Notes
50% (2)
Data Mining Notes
75 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
Unit-I Data Mining
No ratings yet
Unit-I Data Mining
28 pages
Unit 3
No ratings yet
Unit 3
34 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
Data Mining Essentials Explained
No ratings yet
Data Mining Essentials Explained
24 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
DM Notes-1
No ratings yet
DM Notes-1
71 pages
Data Mining Notes
No ratings yet
Data Mining Notes
9 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
Data Mining
No ratings yet
Data Mining
4 pages
Data Mining - Docx Unit 1
No ratings yet
Data Mining - Docx Unit 1
10 pages
Data Mining
No ratings yet
Data Mining
44 pages
Unit-2 Bi
No ratings yet
Unit-2 Bi
26 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
15 pages
Data Mining
No ratings yet
Data Mining
11 pages
Chapter 1 (Introduction)
No ratings yet
Chapter 1 (Introduction)
17 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
Data Mining PDF
No ratings yet
Data Mining PDF
6 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
DW and DM Notes
No ratings yet
DW and DM Notes
89 pages
BIDW Lecture 2
No ratings yet
BIDW Lecture 2
33 pages
Mining
No ratings yet
Mining
7 pages
DM Module1
No ratings yet
DM Module1
15 pages
Data Mining-1
No ratings yet
Data Mining-1
7 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
Data Mining
No ratings yet
Data Mining
19 pages
Unit Ii
No ratings yet
Unit Ii
28 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
8 pages
Unit 1
No ratings yet
Unit 1
27 pages
TJ 11 2017 3 128 132
No ratings yet
TJ 11 2017 3 128 132
5 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Data Mining Cognate
No ratings yet
Data Mining Cognate
23 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
33 pages
Data Mining and C
No ratings yet
Data Mining and C
85 pages
Data Science Module 1 Notes
No ratings yet
Data Science Module 1 Notes
16 pages
DM Notes
No ratings yet
DM Notes
91 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
DM Module 1
No ratings yet
DM Module 1
11 pages
Unit 3 Ba
No ratings yet
Unit 3 Ba
29 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Data Mining
No ratings yet
Data Mining
6 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
11 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Data Mining Notes
No ratings yet
Data Mining Notes
21 pages
Unit III
No ratings yet
Unit III
101 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Data Mining: Applications and Techniques
No ratings yet
Data Mining: Applications and Techniques
60 pages
Grade V - Nouns
No ratings yet
Grade V - Nouns
6 pages
Question Bank - REINFORCEMENT LEARNING
80% (5)
Question Bank - REINFORCEMENT LEARNING
2 pages
87-Article Text-320-1-10-20200229
No ratings yet
87-Article Text-320-1-10-20200229
6 pages
Web Technologies Notes Unit 3
No ratings yet
Web Technologies Notes Unit 3
18 pages
WT ASP - Net Unit IV Notes
No ratings yet
WT ASP - Net Unit IV Notes
23 pages
DBMS Lab Exp-1 - 2 - 3 - 4
No ratings yet
DBMS Lab Exp-1 - 2 - 3 - 4
10 pages
1 Terminology 1
No ratings yet
1 Terminology 1
36 pages
Advanced Database Systems Chapter 2
100% (1)
Advanced Database Systems Chapter 2
16 pages
Rohan CV PDF
No ratings yet
Rohan CV PDF
1 page
SEO Optimized Document Title
25% (4)
SEO Optimized Document Title
4 pages
Mohammad Wahaj Tariq Resume Senior Full Stack Data Engineer
No ratings yet
Mohammad Wahaj Tariq Resume Senior Full Stack Data Engineer
3 pages
Clinical Audit Guide for Improvement
No ratings yet
Clinical Audit Guide for Improvement
40 pages
CRM in Hotel Industry: Dreamland Case Study
No ratings yet
CRM in Hotel Industry: Dreamland Case Study
41 pages
DP-900 Practice Set
100% (2)
DP-900 Practice Set
23 pages
Laboratory Report # - : Title: Figure 1. Schematic Diagram Showing The Cart, Track and Motion Sensor
No ratings yet
Laboratory Report # - : Title: Figure 1. Schematic Diagram Showing The Cart, Track and Motion Sensor
2 pages
Lecture18 LinkedLists
No ratings yet
Lecture18 LinkedLists
196 pages
Spare Parts List: R902496104 Drawing: Material Number
No ratings yet
Spare Parts List: R902496104 Drawing: Material Number
44 pages
Report of Project Final Year PDF
No ratings yet
Report of Project Final Year PDF
73 pages
UT Dallas Syllabus For cs4347.501 05s Taught by Latifur Khan (Lkhan)
No ratings yet
UT Dallas Syllabus For cs4347.501 05s Taught by Latifur Khan (Lkhan)
3 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
Usage and Usability Assessment - Library Practices and Concerns
No ratings yet
Usage and Usability Assessment - Library Practices and Concerns
100 pages
h12342 Storage Config Best Practices Sap Hana Tdi Vmax Recert SG
No ratings yet
h12342 Storage Config Best Practices Sap Hana Tdi Vmax Recert SG
40 pages
Business Intelligence 1: Assignment No. 2
No ratings yet
Business Intelligence 1: Assignment No. 2
4 pages
Samc2090 320
No ratings yet
Samc2090 320
4 pages
Huawei OceanStor Expansion Guide
No ratings yet
Huawei OceanStor Expansion Guide
5 pages
Linux Network
No ratings yet
Linux Network
318 pages
Assessment Plan
No ratings yet
Assessment Plan
6 pages
FDBMS Reg 3rd Edited
No ratings yet
FDBMS Reg 3rd Edited
4 pages
Python File Handling
No ratings yet
Python File Handling
35 pages
Second Year Computer 2015 To 2023 Board SQ LQ Full Book
No ratings yet
Second Year Computer 2015 To 2023 Board SQ LQ Full Book
13 pages
ADF Pre-Requisites
No ratings yet
ADF Pre-Requisites
22 pages
PostgreSQL Vacuum and Index Guide
No ratings yet
PostgreSQL Vacuum and Index Guide
31 pages
Database Management System
No ratings yet
Database Management System
10 pages
RHEL Disk Quota Implementation Guide
No ratings yet
RHEL Disk Quota Implementation Guide
9 pages
Sparksql
No ratings yet
Sparksql
3 pages

Unit 3 DWM Notes

Uploaded by

Unit 3 DWM Notes

Uploaded by

What Is Data Mining?

The key properties of data mining are

 Automatic discovery of patterns

Architecture of Data Mining

2. Data Mining Engine:

3. Pattern Evaluation Module:

Data Mining Process:

The general experimental procedure adapted to data-mining problems involves the

1. State the problem and formulate the hypothesis

2. Collect the data

Introduction to data mining

Data mining is the process of extracting useful information from

Basically, Data mining has been integrated with many other

Benefits of Data Mining

Real-Life Examples of Data Mining

Basic Data Mining Task & Functionalities of Data Mining

2]Predictive Data Mining: This category of data mining is concerned with

Data Mining Functionality:

2. Mining Frequent Patterns, Associations, and Correlations: Frequent patterns

buys (X, “computer”) ⇒ buys (X, “software”) [support = 1%, confidence =

age (X, “20…29”) ∧ income (X, “40K..49K”) ⇒ buys (X, “laptop”)

[support = 2%, confidence = 60%].

The Data Mining Task Primitives are as follows:

So a careful integration can help reduce and avoid redundancies and

Data Mining Matrix

Detailed Breakdown of a Data Mining Matrix

1. Components of a Data Mining Matrix

2.Example Data Mining Matrix

A simplified example of a Data Mining Matrix can be represented as follows:

Data Type Mining Task Algorithm Evaluation Application

Structured Classification Decision Tree Accuracy, F1- Fraud

Unstructured Clustering K-Means Silhouette Score Document

3. Significance of a Data Mining Matrix

4. How to Use a Data Mining Matrix in Research & Industry

data cleaning (pre-processing, feature, selection, data reduction, feature

This involves handling raw data to make it usable. It includes:

Feature selection helps improve model performance by reducing irrelevant or redundant

 Filter Methods: Using statistical tests like correlation or chi-square to remove

Reducing the size of the dataset while retaining essential information:

 Dimensionality Reduction: Techniques like PCA (Principal Component Analysis)

 One-Hot Encoding: Converts categorical variables into binary columns.

5. Handling Noise and Missing Values

Key Issues and Opportunities in Data Mining

1. Data Quality & Preprocessing:

Opportunities in Data Mining:

1. Business Intelligence & Market Analysis:

You might also like