What is Data Mining?
Data mining,is the process of uncovering trends, common themes or patterns in “big data”.
Knowledge from data mining can help companies and governments cut costs or increase
revenue. For example, an early form of data mining was used by companies to analyze huge
amounts of scanner data from supermarkets. This analysis revealed when people were most
likely to shop, and when they were most likely to buy certain products, like wine or baby
products. This enabled the retailer to maximize revenue by ensuring they always had
enough product at the right time in the right place.
As well as gathering data on “What people watch, listen to, or buy”, modern mining
techniques are used to find answers to a wide variety of questions such as:
Which transactions are more likely to be fraudulent?
Who is a “typical” customer?
What mammograms should be flagged as “abnormal”?
Although commonly used in large businesses and organizations, any kind of data can
be mined, from any type of database.
“Data Mining is defined as the procedure of extracting information from huge sets of data. In
other words, we can say that data mining is mining knowledge from data.”
Or
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information
or knowledge extracted so can be used for any of the following applications –
1. Market Analysis
2. Fraud Detection
3. Customer Retention
4. Production Control
5. Science Exploration
Who uses Data Mining?
data mining is primarily used by industries that cater to the consumer, like retail,
financial and marketing companies. If you’ve ever shopped at a retail store and received
customized coupons, that’s a result of mining. Your individual purchase history was analyzed
to find out what products you’ve been buying and what promotions you’re likely to be
interested in. Netflix uses data mining to recommend movies to its customers, Google uses
mining to tailor advertisements to internet users and Walmart uses data mining to manage
inventory and identify areas where new products are likely to be successful. Mining is more
likely to be used by larger companies, as enormous computers are required to sift through
data.
Data Mining Applications
Data mining is highly useful in the following domains −
Market Analysis and Management
Corporate Analysis & Risk Management
Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid
Market Analysis and Management
Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of people buy what
kind of products.
Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may
attract new customers.
Cross Market Analysis − Data mining performs Association/correlations between
product sales.
Target Marketing − Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
Providing Summary Information − Data mining provides us various
multidimensional summary reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector −
Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and
spending.
Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration of
the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. It is also an important step in
data mining as we cannot work with raw data. The quality of the data should be checked before
applying machine learning or data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or
do not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values. There
are some techniques in data cleaning
Handling missing values:
Standard values like “Not Available” or “NA” can be used to replace the missing
values.
Missing values can also be filled manually but it is not recommended when that
dataset is big.
The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.
While using regression or decision tree algorithms the missing value can be
replaced by the most probable value.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
o Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
o Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. It can
be handled in following ways :
o Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
o Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
o Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Benefiits
Improved decision making. Quality data deteriorates at an alarming rate.
Boost results and revenue.
Save money and reduce waste.
Save time and increase productivity.
Protect reputation.
Minimise compliance risks.
Data integration:
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There are
some problems to be considered during data integration.
Schema integration: Integrates metadata from different sources.
Entity identification problem: Identifying entities from multiple databases. For
example, the system or the use should know student _id of one database and
student_name of another database belongs to the same entity.
Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. Like the attribute values from one database
may differ from another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
Data reduction:
This process helps in the reduction of the volume of the data which makes the
analysis easier yet produces the same or almost the same result. This reduction
also helps to reduce storage space. There are some of the techniques in data
reduction are Dimensionality reduction, Numerosity reduction, Data compression.
Dimensionality reduction: This process is necessary for real-world applications as
the data size is big. In this process, the reduction of random variables or attributes
is done so that the dimensionality of the data set can be reduced. Combining and
merging the attributes of the data without losing its original characteristics. This
also helps in the reduction of storage space and computation time is reduced. When
the data is highly dimensional the problem called “Curse of Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this reduction.
Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression it is called lossless compression. Whereas lossy compression reduces
information but it removes only the unnecessary information.
Data Transformation:
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods in data transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset. By smoothing we can
find even a simple change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented in
a smaller range. Example ranging from -1.0 to 1.0.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say that
the apriori algorithm is an association rule leaning that analyzes that people who bought product A also
bought product B.
The primary objective of the apriori algorithm is to create the association rule between different objects.
The association rule describes how two or more objects are related to one another. Apriori algorithm is
also called frequent pattern mining. Generally, you operate the Apriori algorithm on a database that
consists of a huge number of transactions. example, the items customers buy at a Big Bazar.
Methods To Improve Apriori Efficiency
Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or
removed.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it
should be frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches
for frequent itemset in S. It may be possible to lose a global frequent itemset. This
can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked
start point of the database during the scanning of the database.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in
the given database. This data mining technique follows the join and the prune steps
iteratively until the most frequent itemset is achieved. A minimum support threshold is given
in the problem or it is assumed by the user.
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count
more than or equal to min_sup, are taken ahead for the next iteration and the others are
pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the
2-itemset is generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will
have 2 –itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will
follow antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset
subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset
will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most
frequent itemset is achieved.
Advantages of Apriori Algorithm
1. The join and prune steps of the algorithm can be easily implemented on large datasets.
2. It is used to calculate large itemsets.
3. Simple to understand.
4. Easy to implement
Disadvantages of Apriori Algorithm
1. The apriori algorithm works slow compared to other algorithms.
2. The overall performance can be reduced as it scans the database for multiple times.
3. The time complexity and space complexity of the apriori algorithm is O(2D), which is very high.
4. Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
5. Sometimes, you need a huge number of candidate rules, so it becomes computationally more
expensive.
6. Requires many database scans
Steps
o Step 1: Data in the database
o Step 2: Calculate the support/frequency of all items
o Step 3: Discard the items with minimum support less than 2
o Step 4: Combine two items
o Step 5: Calculate the support/frequency of all items
o Step 6: Discard the items with minimum support less than 2
o Step 6.5: Combine three items and calculate their support.
o Step 7: Discard the items with minimum support less than 2
o Result:
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Let's take an example to understand this concept.
We have already discussed above; you need a huge database containing a
large no of transactions. Suppose you have 4000 customers transactions in a
Big Bazar. You have to calculate the Support, Confidence, and Lift for two
products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain
Chocolate, and these 600 transactions include a 200 that includes Biscuits
and chocolates. Using this data, we will find out the support, confidence, and
lift.
Support
Support refers to the default popularity of any product. You find the support
as a quotient of the division of the number of transactions comprising that
product by the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent
Confidence
Confidence refers to the possibility that the customers bought both biscuits
and chocolates together. So, you need to divide the number of transactions
that comprise both biscuits and chocolates by the total number of
transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total
transactions involving Biscuits)
200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought
chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale
of chocolates when you sell biscuits. The mathematical equations of lift are
given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates
together is five times more than that of purchasing the biscuits alone. If the
lift value is below one, it requires that the people are unlikely to buy both the
items together. Larger the value, the better is the combination.
Classification
Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets
as well as small and simple ones.
It primarily involves using algorithms that you can easily modify to improve the data quality.
This is a big reason why supervised learning is particularly common with classification in
techniques in data mining. The primary goal of classification is to connect a variable of
interest with the required variables. The variable of interest should be of qualitative type.
The algorithm establishes the link between the variables for prediction. The algorithm you
use for classification in data mining is called the classifier, and observations you make
through the same are called the instances. You use classification techniques in data mining
when you have to work with qualitative variables.
A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
How Does Classification Works
The Data Classification process includes two steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a class. These tuples can
also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate
the accuracy of classification rules. The classification rules can be applied to the new data
tuples if the accuracy is considered acceptable.
Advantages of Classification with Decision Trees:
1. Inexpensive to construct.
2. Extremely fast at classifying unknown records.
3. Easy to interpret for small-sized trees
4. Accuracy comparable to other classification techniques for many simple data sets.
5. Excludes unimportant features.
Disadvantages of Classification with Decision Trees:
1. Easy to overfit.
2. Decision Boundary restricted to being parallel to attribute axes.
3. Decision tree models are often biased toward splits on features having a large
number of levels.
4. Small changes in the training data can result in large changes to decision
logic.
5. Large trees can be difficult to interpret and the decisions they make may
seem counter intuitive.
Applications of Decision trees in real life :
1. Biomedical Engineering (decision trees for identifying features to be used in implantable
devices).
2. Financial analysis (Customer Satisfaction with a product or service).
3. Astronomy (classify galaxies).
4. System Control.
5. Manufacturing and Production (Quality control, Semiconductor manufacturing, etc).
6. Medicines (diagnosis, cardiology, psychiatry).
7. Physics (Particle detection).
Frequent Pattern Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm represents the database in the form of a tree called
a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented using
one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these fragmented
patterns are analyzed. Thus with this method, the search for frequent itemsets is reduced
comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database. The
purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item
of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while forming
the tree.
Advantages of FP growth algorithm:-
1. Faster than apriori algorithm
2. No candidate generation
3. Only two passes over dataset
Disadvantages of FP growth algorithm:-
1. FP tree may not fit in memory
2. FP tree is expensive to build
Frequent Pattern Algorithm Steps
The frequent pattern growth method lets us find the frequent pattern without candidate generation.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth algorithm:
#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This
step is the same as the first step of Apriori. The count of 1-itemsets in the database is called support
count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the next
itemset with lower count and so on. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending order of
count. If any itemset of this transaction is already present in another branch (for example in the 1st
transaction), then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common node
and new node count is increased by 1 as they are created and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with
the links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this,
traverse the path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Introduction to K nearest neighbour
It is mostly used to classify a data point based on how its neighbours are classified.K
Nearest Neighbour is a simple algorithm that stores all the available cases and classifies
the new case based on a similarity measure.
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can
be used for both classification as well as regression predictive problems. However, it is
mainly used for classification predictive problems in industry. The following two
properties would define KNN well −
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not
have a specialized training phase and uses all the data for training while
classification.
Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.
Working of KNN Algorithm
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of
new data points which further means that the new data point will be assigned a value
based on how closely it matches the points in the training set. We can understand its
working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of
KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be
any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with
the help of any of the method namely: Euclidean, Manhattan or Hamming
distance. The most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
Step 4 − End
Pros and Cons of KNN
Pros
1. Simple to implement
2. Flexible to feature/distance choices
3. Naturally handles multi-class cases
4. Can do well in practice with enough representative data
5. It is very simple algorithm to understand and interpret.
6. It is very useful for nonlinear data because there is no assumption about data in this algorithm.
7. It is a versatile algorithm as we can use it for classification as well as regression.
8. It has relatively high accuracy but there are much better supervised learning models than KNN.
Cons
1. Need to determine the value of parameter K (number of nearest neighbors)
2. Computation cost is quite high because we need to compute the distance of each query instance
to all training samples.
3. Storage of data
4. Must know we have a meaningful distance function.
5. It is computationally a bit expensive algorithm because it stores all the training data.
6. High memory storage required as compared to other supervised learning algorithms.
7. Prediction is slow in case of big N.
8. It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes
like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
What is Naive Bayes algorithm?
It is a classification technique based on Bayes’ Theorem with an assumption of independence among
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature.
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Advantages of Naive Bayes Classifier
The following are some of the benefits of the Naive Bayes classifier:
It is simple and easy to implement
It doesn’t require as much training data
It handles both continuous and discrete data
It is highly scalable with the number of predictors and data points
It is fast and can be used to make real-time predictions
It is not sensitive to irrelevant features