0% found this document useful (0 votes)

15 views15 pages

Dta Mining

Data mining is the process of extracting valuable insights and patterns from large datasets, which can help businesses and organizations improve decision-making and increase revenue. It is widely used in various industries, including retail and finance, for applications such as market analysis, fraud detection, and customer retention. Key steps in data mining include data preprocessing, integration, reduction, transformation, and the use of algorithms like the Apriori algorithm to identify associations between items.

Uploaded by

maira butt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views15 pages

Dta Mining

Uploaded by

maira butt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

What is Data Mining?

Data mining,is the process of uncovering trends, common themes or patterns in “big data”.
Knowledge from data mining can help companies and governments cut costs or increase
revenue. For example, an early form of data mining was used by companies to analyze huge
amounts of scanner data from supermarkets. This analysis revealed when people were most
likely to shop, and when they were most likely to buy certain products, like wine or baby
products. This enabled the retailer to maximize revenue by ensuring they always had
enough product at the right time in the right place.
As well as gathering data on “What people watch, listen to, or buy”, modern mining
techniques are used to find answers to a wide variety of questions such as:
Which transactions are more likely to be fraudulent?
Who is a “typical” customer?
What mammograms should be flagged as “abnormal”?
Although commonly used in large businesses and organizations, any kind of data can
be mined, from any type of database.
“Data Mining is defined as the procedure of extracting information from huge sets of data. In
other words, we can say that data mining is mining knowledge from data.”
Or
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information
or knowledge extracted so can be used for any of the following applications –
1. Market Analysis
2. Fraud Detection
3. Customer Retention
4. Production Control
5. Science Exploration

Who uses Data Mining?

data mining is primarily used by industries that cater to the consumer, like retail,
financial and marketing companies. If you’ve ever shopped at a retail store and received
customized coupons, that’s a result of mining. Your individual purchase history was analyzed
to find out what products you’ve been buying and what promotions you’re likely to be
interested in. Netflix uses data mining to recommend movies to its customers, Google uses
mining to tailor advertisements to internet users and Walmart uses data mining to manage
inventory and identify areas where new products are likely to be successful. Mining is more
likely to be used by larger companies, as enormous computers are required to sift through
data.

Data Mining Applications

Data mining is highly useful in the following domains −

 Market Analysis and Management

 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid

Market Analysis and Management

Listed below are the various fields of market where data mining is used −
 Customer Profiling − Data mining helps determine what kind of people buy what
kind of products.
 Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may
attract new customers.
 Cross Market Analysis − Data mining performs Association/correlations between
product sales.
 Target Marketing − Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
 Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
 Providing Summary Information − Data mining provides us various
multidimensional summary reports.

Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −

 Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
 Resource Planning − It involves summarizing and comparing the resources and
spending.
 Competition − It involves monitoring competitors and market directions.

Fraud Detection

Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration of
the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable

format. It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or data
mining algorithms.

Why is Data preprocessing important?

Preprocessing of data is mainly to check the data quality. It is also an important step in
data mining as we cannot work with raw data. The quality of the data should be checked before
applying machine learning or data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following

 Accuracy: To check whether the data entered is correct or not.

 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or
do not match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.

Data cleaning:

Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values. There
are some techniques in data cleaning

Handling missing values:

 Standard values like “Not Available” or “NA” can be used to replace the missing
values.
 Missing values can also be filled manually but it is not recommended when that
dataset is big.
 The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.
 While using regression or decision tree algorithms the missing value can be
replaced by the most probable value.
 (a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
o Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
o Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. It can
be handled in following ways :
o Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

o Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
o Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Benefiits
 Improved decision making. Quality data deteriorates at an alarming rate.
 Boost results and revenue.
 Save money and reduce waste.
 Save time and increase productivity.
 Protect reputation.
 Minimise compliance risks.
Data integration:

The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There are
some problems to be considered during data integration.

 Schema integration: Integrates metadata from different sources.

 Entity identification problem: Identifying entities from multiple databases. For
example, the system or the use should know student _id of one database and
student_name of another database belongs to the same entity.
 Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. Like the attribute values from one database
may differ from another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.

Data reduction:

This process helps in the reduction of the volume of the data which makes the
analysis easier yet produces the same or almost the same result. This reduction
also helps to reduce storage space. There are some of the techniques in data
reduction are Dimensionality reduction, Numerosity reduction, Data compression.

 Dimensionality reduction: This process is necessary for real-world applications as

the data size is big. In this process, the reduction of random variables or attributes
is done so that the dimensionality of the data set can be reduced. Combining and
merging the attributes of the data without losing its original characteristics. This
also helps in the reduction of storage space and computation time is reduced. When
the data is highly dimensional the problem called “Curse of Dimensionality” occurs.
 Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this reduction.
 Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression it is called lossless compression. Whereas lossy compression reduces
information but it removes only the unnecessary information.

Data Transformation:

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods in data transformation.

 Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset. By smoothing we can
find even a simple change that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good the results are more relevant.
 Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented in
a smaller range. Example ranging from -1.0 to 1.0.

Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say that
the apriori algorithm is an association rule leaning that analyzes that people who bought product A also
bought product B.

The primary objective of the apriori algorithm is to create the association rule between different objects.
The association rule describes how two or more objects are related to one another. Apriori algorithm is
also called frequent pattern mining. Generally, you operate the Apriori algorithm on a database that
consists of a huge number of transactions. example, the items customers buy at a Big Bazar.

Methods To Improve Apriori Efficiency

Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or
removed.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it
should be frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches
for frequent itemset in S. It may be possible to lose a global frequent itemset. This
can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked
start point of the database during the scanning of the database.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in
the given database. This data mining technique follows the join and the prune steps
iteratively until the most frequent itemset is achieved. A minimum support threshold is given
in the problem or it is assumed by the user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count
more than or equal to min_sup, are taken ahead for the next iteration and the others are
pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the
2-itemset is generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will
have 2 –itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will
follow antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset
subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset
will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most
frequent itemset is achieved.

Advantages of Apriori Algorithm

1. The join and prune steps of the algorithm can be easily implemented on large datasets.
2. It is used to calculate large itemsets.
3. Simple to understand.
4. Easy to implement

Disadvantages of Apriori Algorithm

1. The apriori algorithm works slow compared to other algorithms.
2. The overall performance can be reduced as it scans the database for multiple times.
3. The time complexity and space complexity of the apriori algorithm is O(2D), which is very high.
4. Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
5. Sometimes, you need a huge number of candidate rules, so it becomes computationally more
expensive.
6. Requires many database scans
Steps

o Step 1: Data in the database

o Step 2: Calculate the support/frequency of all items
o Step 3: Discard the items with minimum support less than 2
o Step 4: Combine two items
o Step 5: Calculate the support/frequency of all items
o Step 6: Discard the items with minimum support less than 2
o Step 6.5: Combine three items and calculate their support.
o Step 7: Discard the items with minimum support less than 2
o Result:

Components of Apriori algorithm

The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a

large no of transactions. Suppose you have 4000 customers transactions in a
Big Bazar. You have to calculate the Support, Confidence, and Lift for two
products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain

Chocolate, and these 600 transactions include a 200 that includes Biscuits
and chocolates. Using this data, we will find out the support, confidence, and
lift.

Support
Support refers to the default popularity of any product. You find the support
as a quotient of the division of the number of transactions comprising that
product by the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent

Confidence
Confidence refers to the possibility that the customers bought both biscuits
and chocolates together. So, you need to divide the number of transactions
that comprise both biscuits and chocolates by the total number of
transactions to get the confidence.

Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total

transactions involving Biscuits)

200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought

chocolates also.

Lift
Consider the above example; lift refers to the increase in the ratio of the sale
of chocolates when you sell biscuits. The mathematical equations of lift are
given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates
together is five times more than that of purchasing the biscuits alone. If the
lift value is below one, it requires that the people are unlikely to buy both the
items together. Larger the value, the better is the combination.
Classification

Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets
as well as small and simple ones.
It primarily involves using algorithms that you can easily modify to improve the data quality.
This is a big reason why supervised learning is particularly common with classification in
techniques in data mining. The primary goal of classification is to connect a variable of
interest with the required variables. The variable of interest should be of qualitative type.
The algorithm establishes the link between the variables for prediction. The algorithm you
use for classification in data mining is called the classifier, and observations you make
through the same are called the instances. You use classification techniques in data mining
when you have to work with qualitative variables.

A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.

How Does Classification Works

The Data Classification process includes two steps −

 Building the Classifier or Model

 Using Classifier for Classification

Building the Classifier or Model

 This step is the learning step

 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and their
associated class labels.
 Each tuple that constitutes the training set is referred to as a class. These tuples can
also be referred to as sample, object or data points.

Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to estimate
the accuracy of classification rules. The classification rules can be applied to the new data
tuples if the accuracy is considered acceptable.

Advantages of Classification with Decision Trees:

1. Inexpensive to construct.
2. Extremely fast at classifying unknown records.
3. Easy to interpret for small-sized trees
4. Accuracy comparable to other classification techniques for many simple data sets.
5. Excludes unimportant features.

Disadvantages of Classification with Decision Trees:

1. Easy to overfit.
2. Decision Boundary restricted to being parallel to attribute axes.
3. Decision tree models are often biased toward splits on features having a large
number of levels.
4. Small changes in the training data can result in large changes to decision
logic.
5. Large trees can be difficult to interpret and the decisions they make may
seem counter intuitive.

Applications of Decision trees in real life :

1. Biomedical Engineering (decision trees for identifying features to be used in implantable
devices).
2. Financial analysis (Customer Satisfaction with a product or service).
3. Astronomy (classify galaxies).
4. System Control.
5. Manufacturing and Production (Quality control, Semiconductor manufacturing, etc).
6. Medicines (diagnosis, cardiology, psychiatry).
7. Physics (Particle detection).
Frequent Pattern Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm represents the database in the form of a tree called
a frequent pattern tree or FP tree.

This tree structure will maintain the association between the itemsets. The database is fragmented using
one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these fragmented
patterns are analyzed. Thus with this method, the search for frequent itemsets is reduced
comparatively.

FP Tree

Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database. The
purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item
of the itemset.

The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while forming
the tree.

Advantages of FP growth algorithm:-

1. Faster than apriori algorithm

2. No candidate generation

3. Only two passes over dataset

Disadvantages of FP growth algorithm:-

1. FP tree may not fit in memory

2. FP tree is expensive to build

Frequent Pattern Algorithm Steps

The frequent pattern growth method lets us find the frequent pattern without candidate generation.

Let us see the steps followed to mine the frequent pattern using frequent pattern growth algorithm:

#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This
step is the same as the first step of Apriori. The count of 1-itemsets in the database is called support
count or frequency of 1-itemset.

#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.

#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the next
itemset with lower count and so on. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending order of
count. If any itemset of this transaction is already present in another branch (for example in the 1st
transaction), then this transaction branch would share a common prefix to the root.

This means that the common itemset is linked to the new node of another itemset in this transaction.

#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common node
and new node count is increased by 1 as they are created and linked according to transactions.

#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with
the links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this,
traverse the path in the FP Tree. This path or paths are called a conditional pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).

#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.

#8) Frequent Patterns are generated from the Conditional FP Tree.

Introduction to K nearest neighbour

It is mostly used to classify a data point based on how its neighbours are classified.K
Nearest Neighbour is a simple algorithm that stores all the available cases and classifies
the new case based on a similarity measure.
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can
be used for both classification as well as regression predictive problems. However, it is
mainly used for classification predictive problems in industry. The following two
properties would define KNN well −
 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not
have a specialized training phase and uses all the data for training while
classification.
 Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of
new data points which further means that the new data point will be assigned a value
based on how closely it matches the points in the training set. We can understand its
working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of
KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be
any integer.
Step 3 − For each point in the test data do the following −
 3.1 − Calculate the distance between test data and each row of training data with
the help of any of the method namely: Euclidean, Manhattan or Hamming
distance. The most commonly used method to calculate distance is Euclidean.
 3.2 − Now, based on the distance value, sort them in ascending order.
 3.3 − Next, it will choose the top K rows from the sorted array.
 3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
Step 4 − End

Pros and Cons of KNN

Pros
1. Simple to implement
2. Flexible to feature/distance choices
3. Naturally handles multi-class cases
4. Can do well in practice with enough representative data
5. It is very simple algorithm to understand and interpret.
6. It is very useful for nonlinear data because there is no assumption about data in this algorithm.
7. It is a versatile algorithm as we can use it for classification as well as regression.
8. It has relatively high accuracy but there are much better supervised learning models than KNN.

Cons
1. Need to determine the value of parameter K (number of nearest neighbors)
2. Computation cost is quite high because we need to compute the distance of each query instance
to all training samples.
3. Storage of data
4. Must know we have a meaningful distance function.
5. It is computationally a bit expensive algorithm because it stores all the training data.
6. High memory storage required as compared to other supervised learning algorithms.
7. Prediction is slow in case of big N.
8. It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN
The following are some of the areas in which KNN can be applied successfully −

Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.

Politics
With the help of KNN algorithms, we can classify a potential voter into various classes
like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.

What is Naive Bayes algorithm?

It is a classification technique based on Bayes’ Theorem with an assumption of independence among
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature.

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Advantages of Naive Bayes Classifier

The following are some of the benefits of the Naive Bayes classifier:

 It is simple and easy to implement

 It doesn’t require as much training data

 It handles both continuous and discrete data

 It is highly scalable with the number of predictors and data points

 It is fast and can be used to make real-time predictions

 It is not sensitive to irrelevant features

Unit 3
No ratings yet
Unit 3
22 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Data Mining
No ratings yet
Data Mining
55 pages
Chemical Transducer
100% (1)
Chemical Transducer
15 pages
Assignment JTW115E 2023-2024 v5
No ratings yet
Assignment JTW115E 2023-2024 v5
5 pages
Adc5000 Series: AC/DC Switch Mode Power Supplies and Rectifiers For Industrial and Telecom Applications
No ratings yet
Adc5000 Series: AC/DC Switch Mode Power Supplies and Rectifiers For Industrial and Telecom Applications
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
5 pages
Unit-1 (Data Mining)
No ratings yet
Unit-1 (Data Mining)
13 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining Tutorial Guide
No ratings yet
Data Mining Tutorial Guide
30 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
6 pages
Fundamental of Data Mining (CSI-508) .
No ratings yet
Fundamental of Data Mining (CSI-508) .
19 pages
Unit 1
No ratings yet
Unit 1
27 pages
Data Mining Data Mining: Knowledge Discovery in Data (KDD)
No ratings yet
Data Mining Data Mining: Knowledge Discovery in Data (KDD)
26 pages
Data Mining
No ratings yet
Data Mining
395 pages
McNemara Test
No ratings yet
McNemara Test
11 pages
Abg10 2 Abg 35 2 Multiturn Bevel Gearbox Technical Datasheet en
No ratings yet
Abg10 2 Abg 35 2 Multiturn Bevel Gearbox Technical Datasheet en
2 pages
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
No ratings yet
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
39 pages
IT SKILL LAB KMBN MBA 1st Sem
No ratings yet
IT SKILL LAB KMBN MBA 1st Sem
23 pages
AutoCAD and Its Applications - Capítulo 5
100% (1)
AutoCAD and Its Applications - Capítulo 5
26 pages
LDPC Codes
No ratings yet
LDPC Codes
3 pages
DWDM Lab Using Python
No ratings yet
DWDM Lab Using Python
15 pages
Tutorial 6: Fluid Mechanics (CLB 11003) Chapter 6: Equipment in Fluid Flow
No ratings yet
Tutorial 6: Fluid Mechanics (CLB 11003) Chapter 6: Equipment in Fluid Flow
7 pages
Ict Lecturenotes2 230825065644 F36a5d57
No ratings yet
Ict Lecturenotes2 230825065644 F36a5d57
32 pages
What Is The Importance of Voice Quality, Posters, Word Rate and Eye Contact at The Time of Presentation? Eye Contact
No ratings yet
What Is The Importance of Voice Quality, Posters, Word Rate and Eye Contact at The Time of Presentation? Eye Contact
5 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Data Mining Final
No ratings yet
Data Mining Final
38 pages
Topology Concepts for Math Students
No ratings yet
Topology Concepts for Math Students
45 pages
Chapter 3 Methods of Lead Optimization
No ratings yet
Chapter 3 Methods of Lead Optimization
23 pages
VI Editor
No ratings yet
VI Editor
32 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Proteus CT1628 Electrical Simulation
No ratings yet
Proteus CT1628 Electrical Simulation
4 pages
12 Shared - Folders
No ratings yet
12 Shared - Folders
11 pages
Sample Business Letters
No ratings yet
Sample Business Letters
11 pages
Notes For DMDWH - Module1
No ratings yet
Notes For DMDWH - Module1
21 pages
Kantar - Consultant Interview Questions
No ratings yet
Kantar - Consultant Interview Questions
11 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Data Mining OVERVIEW
No ratings yet
Data Mining OVERVIEW
8 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
CV Writing
No ratings yet
CV Writing
22 pages
L - 1 Data Mining
No ratings yet
L - 1 Data Mining
17 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
KDD and Data Mining Explained
No ratings yet
KDD and Data Mining Explained
46 pages
Subject: Change of Product Specifications and Price
No ratings yet
Subject: Change of Product Specifications and Price
1 page
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Unit 3
No ratings yet
Unit 3
18 pages
VO - MCA - S4 - Data Mining Unit 1
No ratings yet
VO - MCA - S4 - Data Mining Unit 1
18 pages
Data Mining
No ratings yet
Data Mining
8 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Data Mining
No ratings yet
Data Mining
11 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Metadata & Data Mining Essentials
No ratings yet
Metadata & Data Mining Essentials
36 pages
DM Module1
No ratings yet
DM Module1
15 pages
Quiz
No ratings yet
Quiz
16 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Outokumpu Stainless Steel Bar Sizes and Specifications
No ratings yet
Outokumpu Stainless Steel Bar Sizes and Specifications
2 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
CS311 Final Term Question File 2019, 2020, 2021
No ratings yet
CS311 Final Term Question File 2019, 2020, 2021
5 pages
DM-Unit 1
No ratings yet
DM-Unit 1
13 pages
202,203, P205, P208 Bus Timetable
No ratings yet
202,203, P205, P208 Bus Timetable
6 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
AA-2285573-1 - Tip Over Test
No ratings yet
AA-2285573-1 - Tip Over Test
3 pages
Nagai Tilt XRD
No ratings yet
Nagai Tilt XRD
7 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
46 pages
Mid Sem 2021 MDD Date 02 March 2021
No ratings yet
Mid Sem 2021 MDD Date 02 March 2021
3 pages
Physics of Fusion Power
No ratings yet
Physics of Fusion Power
22 pages
CLS Aipmt-18-19 XII Phy Study-Package-6 SET-2 Chapter-8 PDF
No ratings yet
CLS Aipmt-18-19 XII Phy Study-Package-6 SET-2 Chapter-8 PDF
17 pages
1LE2321-1CA11-4GA3 Datasheet en
No ratings yet
1LE2321-1CA11-4GA3 Datasheet en
1 page
Kantar Consultant Interview Questions 1
No ratings yet
Kantar Consultant Interview Questions 1
11 pages
5.2 The Definite Integral HW
No ratings yet
5.2 The Definite Integral HW
15 pages
Diborane: Properties and Applications
No ratings yet
Diborane: Properties and Applications
36 pages
Exp 2 (Homemade Ice Cream)
No ratings yet
Exp 2 (Homemade Ice Cream)
8 pages
SR4850 Product Datasheet
No ratings yet
SR4850 Product Datasheet
9 pages
Climate Influence of Thermal Energy
No ratings yet
Climate Influence of Thermal Energy
6 pages
Blockholders' Power & Firm Value
No ratings yet
Blockholders' Power & Firm Value
13 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
dataVAR LAAR
No ratings yet
dataVAR LAAR
1 page
Pavement Engineering Solutions
No ratings yet
Pavement Engineering Solutions
1 page
Data Mining Insights for Businesses
No ratings yet
Data Mining Insights for Businesses
3 pages
Data Mining
No ratings yet
Data Mining
2 pages
Data Mining Process & Applications
No ratings yet
Data Mining Process & Applications
4 pages
Advance Database With Lab: Professor & Head (Department of Software Engineering)
No ratings yet
Advance Database With Lab: Professor & Head (Department of Software Engineering)
5 pages

Dta Mining

Uploaded by

Dta Mining

Uploaded by

What is Data Mining?

Who uses Data Mining?

Data Mining Applications

Data mining is highly useful in the following domains −

 Market Analysis and Management

Market Analysis and Management

Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −

Data preprocessing is the process of transforming raw data into an understandable

Why is Data preprocessing important?

 Accuracy: To check whether the data entered is correct or not.

Handling missing values:

 Schema integration: Integrates metadata from different sources.

 Dimensionality reduction: This process is necessary for real-world applications as

Methods To Improve Apriori Efficiency

Advantages of Apriori Algorithm

Disadvantages of Apriori Algorithm

o Step 1: Data in the database

Components of Apriori algorithm

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain

Confidence = (Transactions relating both biscuits and Chocolate) / (Total

It means that 50 percent of customers who bought biscuits bought

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

How Does Classification Works

The Data Classification process includes two steps −

 Building the Classifier or Model

Building the Classifier or Model

 This step is the learning step

Using Classifier for Classification

Advantages of Classification with Decision Trees:

Disadvantages of Classification with Decision Trees:

Applications of Decision trees in real life :

Advantages of FP growth algorithm:-

3. Only two passes over dataset

Disadvantages of FP growth algorithm:-

2. FP tree is expensive to build

Frequent Pattern Algorithm Steps

#8) Frequent Patterns are generated from the Conditional FP Tree.

Introduction to K nearest neighbour

Working of KNN Algorithm

Pros and Cons of KNN

What is Naive Bayes algorithm?

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Advantages of Naive Bayes Classifier

 It is simple and easy to implement

 It doesn’t require as much training data

 It handles both continuous and discrete data

 It is highly scalable with the number of predictors and data points

 It is fast and can be used to make real-time predictions

 It is not sensitive to irrelevant features

You might also like