Past PPR
Past PPR
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view rooted in statistics, mathematics,
and numerical analysis.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set. Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or medical data
sets.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event
Why should we use data warehousing and how can you extract data for
analysis?
Why Use Data Warehousing:
Data warehousing offers several advantages for organizations seeking to leverage their data effectively. It
provides a centralized repository for data, integrating information from various sources while ensuring
consistency and accuracy. Data warehousing stores historical data, enabling long-term trend analysis. It
also improves data quality through cleansing and transformation processes. With optimized query
Explain the use of data mining queries or why data mining queries are
more helpful?
Data mining queries are used to extract patterns and insights from large datasets. They are more helpful
than traditional database queries because they can be used to discover hidden relationships and
patterns in the data that would be difficult or impossible to find using traditional methods.
Data mining queries are more helpful than traditional database queries because they can be used to:
• Find hidden relationships and patterns in the data. Traditional database queries can only be used
to find data that has already been explicitly defined. Data mining queries, on the other hand, can
be used to discover new relationships and patterns in the data that were not previously known.
• Analyze large datasets more efficiently. Traditional database queries can be slow and inefficient
when used to analyze large datasets. Data mining queries, on the other hand, are designed to be
efficient and scalable.
• Provide more insights into the data. Traditional database queries can only be used to retrieve
raw data. Data mining queries, on the other hand, can be used to extract insights from the data,
such as patterns, trends, and predictions.
Data Selection
The first step in the KDD process is identifying and selecting the relevant data for analysis. This involves
choosing the relevant data sources, such as databases, data warehouses, and data streams, and
determining which data is required for the analysis.
Data Preprocessing
After selecting the data, the next step is data preprocessing. This step involves cleaning the data,
removing outliers, and removing missing, inconsistent, or irrelevant data. This step is critical, as the data
quality can significantly impact the accuracy and effectiveness of the analysis.
Data Transformation
Once the data is preprocessed, the next step is to transform it into a format that data mining techniques
can analyze. This step involves reducing the data dimensionality, aggregating the data, normalizing it,
and discretizing it to prepare it for further analysis.
Data Mining
This is the heart of the KDD process and involves applying various data mining techniques to the
transformed data to discover hidden patterns, trends, relationships, and insights. A few of the most
common data mining techniques include clustering, classification, association rule mining, and anomaly
detection.
Pattern Evaluation
After the data mining, the next step is to evaluate the discovered patterns to determine their usefulness
and relevance. This involves assessing the quality of the patterns, evaluating their significance, and
selecting the most promising patterns for further analysis.
Knowledge Representation
This step involves representing the knowledge extracted from the data in a way humans can easily
understand and use. This can be done through visualizations, reports, or other forms of communication
that provide meaningful insights into the data.
Deployment
The final step in the KDD process is to deploy the knowledge and insights gained from the data mining
process to practical applications. This involves integrating the knowledge into decision-making processes
or other applications to improve organizational efficiency and effectiveness.
There are a number of different methods to perform cluster analysis. Some of them are,
Centroid-based Clustering
In this type of clustering, clusters are represented by a central entity, which may or may not be a part of
the given data set. K-Means method of clustering is used in this method, where k are the cluster centers
and objects are assigned to the nearest cluster centers.
Distribution-based Clustering
It is a type of clustering model closely related to statistics based on the modals of distribution. Objects
that belong to the same distribution are put into a single cluster.This type of clustering can capture some
complex properties of objects like correlation and dependence between attributes.
Alread explaind
Explain the difference between data mining and data warehousing
S. Basis of
No. Comparison Data Warehousing Data Mining
A data warehouse is a
database system that is
Data mining is the process of
designed for analytical
analyzing data patterns.
analysis instead of
1. Definition transactional work.
Data is stored
Data is analyzed regularly.
2. Process periodically.
Subject-oriented,
AI, statistics, databases,
integrated, time-varying
and machine learning systems are
and non-volatile
all used in data mining
constitute data
technologies.
6. Functionality warehouses.
Data Quality:
Poor data quality can lead to inaccurate results and flawed insights. Incomplete, noisy, or inconsistent
data can hinder the effectiveness of data mining algorithms.
Data Preprocessing:
Before mining can begin, data often needs to be cleaned, transformed, and integrated from various
sources. This preprocessing step can be time-consuming and resource-intensive.
Scalability:
Handling large datasets can be a significant challenge. Many data mining algorithms struggle to scale
efficiently as the size of the dataset increases.
Dimensionality:
High-dimensional data can make it difficult to identify relevant patterns and relationships.
Dimensionality reduction techniques are often needed to reduce the complexity of the data.
Algorithm Selection:
Choosing the right data mining algorithm for a specific task can be challenging. Different algorithms have
strengths and weaknesses, and selecting the wrong one can lead to suboptimal results.
Interpretability:
Some data mining algorithms, such as deep learning models, can be complex and difficult to interpret.
Understanding the insights generated by these models can be a challenge.
Data Imbalance:
In many real-world datasets, the distribution of classes or outcomes may be highly imbalanced. This can
affect the performance of data mining algorithms, which may favor the majority class.
Fraud Detection:
Financial institutions and e-commerce companies employ data mining to detect fraudulent activities by
analyzing transaction patterns and anomalies.
Healthcare:
Data mining aids in disease prediction, patient diagnosis, and treatment recommendation. It also helps
healthcare providers optimize resource allocation and improve patient outcomes.
Retail:
Retailers use data mining to analyze sales data, optimize inventory, and make pricing decisions. It also
enables them to identify cross-selling and upselling opportunities.
Manufacturing:
Data mining is used for quality control, predictive maintenance, and process optimization in
manufacturing industries. It helps reduce downtime and improve production efficiency.
Telecommunications:
Telecom companies use data mining to analyze call records, network data, and customer behavior to
improve network performance and offer personalized services.
Recommendation Systems:
Online platforms like Netflix and Amazon use data mining to provide personalized recommendations
based on user preferences and behavior.
Environmental Science:
Data mining is used to analyze environmental data for climate modeling, weather forecasting, and
identifying patterns in environmental changes.
Energy Management:
Data mining assists in energy consumption analysis, load forecasting, and optimizing energy distribution
in the utility industry.
Financial Forecasting:
Data mining is employed in the financial sector for stock market prediction, credit risk assessment, and
portfolio optimization.
Data Noise:
This is when the data itself has errors or mistakes, like missing information or typos. For example, if you
have a list of numbers, data noise might occur if some numbers are recorded incorrectly.
Attribute Noise:
Attribute noise happens when specific parts of the data have errors. For instance, in a list of people's
information, attribute noise could be mistakes in ages or names.
Class Noise:
Class noise occurs when the labels or categories in your data have errors. In a task like sorting emails as
spam or not spam, class noise might mean some emails are labeled wrong.
Contextual Noise:
This type of noise happens when the meaning of data changes depending on the situation. For example,
if you're analyzing data from different sources, there might be contextual noise because the data wasn't
collected in the same way.
Temporal Noise:
Temporal noise is about changes in data over time. It can be things like seasonal patterns or trends.
When analyzing data over time, it's important to consider temporal noise.
Spatial Noise:
Spatial noise is about how data changes based on where it's collected. This can be a problem in things
like maps or GPS data if the location information isn't accurate.
Sensor Noise:
When data is collected using sensors or devices, sensor noise can happen because of measurement
errors or problems with the sensors themselves. This can be a challenge in fields like environmental
monitoring or smart devices.
Actions:
For each condition that is met, there is a corresponding action or set of actions associated with it. These
actions represent what should be done when the conditions specified in the rule are satisfied. Actions
can be diverse, ranging from simple recommendations to more complex processes or calculations.
Rule Evaluation:
When a decision needs to be made, the decision system or algorithm evaluates each decision rule one
by one. It checks whether the conditions specified in a rule are true or false based on the available data
or information.
Rule Activation:
If the conditions of a rule are true, the rule is said to be "activated." This means that the associated
actions are triggered and executed. If multiple rules are activated, their actions may be executed
sequentially or concurrently, depending on the system's design.
Decision Making:
The overall decision-making process involves considering the collective actions of all activated rules.
Depending on the context and the specific goals of the decision system, the final decision may be a
combination of these actions, a single action, or a prioritized set of actions.
Scalability:
Decision rules can be scaled to accommodate complex decision-making scenarios by creating a large set
of rules that cover various situations and outcomes. However, managing a large number of rules can be
challenging, so techniques like rule prioritization and optimization are often employed.
What is Cluster Analysis? Explain its working with diagram. Discuss its
advantages disadvantages.
Old points Already expalin
1. It can help identify patterns and relationships within a dataset that may not be immediately
obvious.
2. It can be used for exploratory data analysis and can help with feature selection.
1. It can be sensitive to the choice of initial conditions and the number of clusters.
3. It can be difficult to interpret the results of the analysis if the clusters are not well-defined.
5. The results of the analysis can be affected by the choice of clustering algorithm used.
Preprocessing involves cleaning and preparing raw data before it is fed into a data mining algorithm. It is
crucial because real-world data is often messy, containing missing values, outliers, and inconsistencies.
By cleaning and standardizing the data, preprocessing ensures that the data mining algorithm can work
effectively and produce meaningful results. It helps in reducing noise and ensuring that the patterns and
insights extracted from the data are accurate and reliable.
Dimensionality reduction, on the other hand, is important because high-dimensional data can pose
several challenges in data mining. High-dimensional data often suffers from the curse of dimensionality,
which can lead to increased computational complexity and decreased algorithm performance.
Dimensionality reduction techniques aim to reduce the number of features or variables in the data while
preserving important information. This not only speeds up the data mining process but also helps in
preventing overfitting and improving the generalizability of the model.
Data cleaning:
First step in the Knowledge Discovery Process is Data cleaning in which noise and inconsistent data is
removed.
Data Integration:
Second step is Data Integration in which multiple data sources are combined.
Data Selection:
Next step is Data Selection in which data relevant to the analysis task are retrieved from the database.
Data Transformation:
In Data Transformation, data are transformed into forms appropriate for mining by performing summary
or aggregation operations.
Data Mining:
In Data Mining, data mining methods (algorithms) are applied in order to extract data patterns.
Pattern Evaluation:
In Pattern Evaluation, data patterns are identified based on some interesting measures.
Knowledge Presentation:
In Knowledge Presentation, knowledge is represented to user using many knowledge representation
techniques.
Attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Qualitative Attributes:
Nominal Attributes
The values of a Nominal attribute are names of things, some kind of symbols. Values of Nominal
attributes represents some category or state and that’s why nominal attribute also referred as
categorical attributes and there is no order (rank, position) among values of the nominal attribute.
Binary Attributes:
Binary data has only 2 values/states. For Example yes or no, affected or unaffected, true or false.
Ordinal Attributes :
The Ordinal Attributes contains values that have a meaningful sequence or ranking(order) between
them, but the magnitude between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.
Quantitative Attributes:
Numeric:
A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types, interval, and ratio.
Discrete :
Discrete data have finite values it can be numerical and can also be in categorical form. These attributes
has finite or countably infinite set of values.
Continuous:
Continuous data have an infinite no of states. Continuous data is of float type. There can be many values
between 2 and 3.
Bagging
2. We create a m-number of subsets of data from the training set. We take a subset of N sample
points from the initial dataset for each subset. Each subset is taken with replacement. This
means that a specific data point can be sampled more than once.
3. For each subset of data, we train the corresponding weak learners independently. These models
are homogeneous, meaning that they are of the same type.
5. The predictions are aggregated into a single prediction. For this, either max voting or averaging is
used.
Boosting
3. We test the trained weak learner using the training data. As a result of the testing, some data
4. Each data point with the wrong prediction is sent into the second subset of data, and this subset
is updated.
5. Using this updated subset, we train and test the second weak learner.
6. We continue with the following subset until the total number of subsets is reached.
7. We now have the total prediction. The overall prediction has already been aggregated at each
Suppose that the data for analysis includes the attribute price (in dollars)
8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34.Use smoothing by bin means to
smooth these data, using a bin depth of 4. Illustrate your steps.
steps to illustrate how to do this:
Step 1: Sort the Data First, let's sort the data in ascending order: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Step 2: Create Bins Next, we need to create bins with a depth of 4. This means we'll group the data into
• Bin 1: 8, 9, 15, 16
Step 3: Calculate Bin Means Now, calculate the mean (average) value for each bin:
Step 4: Replace Data Points Replace each data point in the original data set with the mean value of its
Smoothed Data: 12, 12, 12, 12, 23, 23, 23, 23, 30.25, 30.25, 30.25, 30.25
Euclidean Distance:
Euclidean distance is the most widely used distance metric. It calculates the straight-line distance
between two points in the multi-dimensional space. For a data point and a cluster, you can compute the
Euclidean distance between the data point and the centroid (mean) of the cluster.
Manhattan Distance:
Manhattan distance, also known as L1 distance or city block distance, measures the distance between
two points by summing the absolute differences of their coordinates. It can be used to calculate the
distance between a point and a cluster's centroid.
Cosine Similarity:
Cosine similarity measures the cosine of the angle between two vectors. In the context of clustering, you
can calculate the cosine similarity between a data point and the centroid of a cluster. This is particularly
useful for text data or when you want to measure similarity based on the direction of vectors.
Mahalanobis Distance:
Mahalanobis distance considers the correlations between variables in the dataset. It's a more advanced
metric that takes into account the covariance structure of the data. It is useful when dealing with
multivariate data.
Jaccard Similarity:
Jaccard similarity measures the similarity between two sets by comparing the size of their intersection to
the size of their union. It's often used for clustering binary or categorical data.
Correlation-Based Distance:
For datasets with continuous variables, you can use correlation-based distances like Pearson or
Spearman correlation coefficients to measure the similarity between a data point and a cluster's central
tendency.
Apply the KNN classifier to predict the diabetic patient with the given
features BMI & age. Assume k-3. Test example: BMI-43.6, Age=40,
Sugar?
Formula
31 56 1 √((43.6-31)^2+(40-56)^2 ) 20.37
Web structure mining focuses on analyzing the link structure of the World Wide Web, including
hyperlinks between web pages. This application of data mining helps uncover valuable insights about
website relationships, authority, and connectivity. Search engine optimization (SEO) is a prominent
application of web structure mining. By analyzing the link structure of the web, search engines can
determine the authority and relevance of web pages. Pages with more inbound links from reputable
sources are considered authoritative and are more likely to rank higher in search engine results.
The hold-out method is good to use when you have a very large dataset, you’re on a time crunch, or you
are starting to build an initial model in your data science project. Keep in mind that because cross-
validation uses multiple train-test splits, it takes more computational power and time to run than using
the holdout method.
i)edit_distance(Faloutsos, Kollios)
Faloutsos, Gough:
Character in Faloutsos Character in Gough Edit distance
F G 1
a o 1
l u 1
o g 1
u h 1
t 0 1
s 0 1
The total edit distance is the sum of the edit distances between each pair of characters, which is 7.
Median
Median is the middle value in a dataset when the data is arranged in numerical order. If there are two
middle values, the median is the average of those two values.
Mode
Mode is the most frequent value in a dataset.
Standard deviation
Standard deviation is a measure of how spread out the values in a dataset are. A higher standard
deviation means that the values are more spread out, while a lower standard deviation means that the
values are more concentrated around the mean.
Variance
Variance is the square of the standard deviation. It is another measure of how spread out the values in a
dataset are.
[65, 67, 68, 69, 70, 71, 72, 73, 74, 75]
The mean height is 70.5 inches. The median height is 70 inches. The mode height is 70 inches. The
standard deviation is 2.5 inches. The variance is 6.25 inches. The five number summary is as follows:
Minimum: 65 inches
Q1: 67 inches
Median: 70 inches
Q3: 73 inches
Maximum: 75 inches
To classify a new data point, the decision tree starts at the root node and asks the question associated
with that node. It then follows the branch that corresponds to the answer to the question. This process
continues until the decision tree reaches a leaf node, which contains the predicted class label.
Fraud detection:
Clustering can be used to identify fraudulent transactions by grouping together transactions that exhibit
similar patterns.
Image segmentation:
Clustering can be used to segment images into different objects, such as people, cars, and trees. This can
be useful for tasks such as object detection and recognition.
Medical diagnosis:
Clustering can be used to group patients with similar symptoms and medical history. This information
can then be used to diagnose diseases and recommend treatments.
Hierarchical clustering:
Hierarchical clustering is a type of clustering algorithm that produces a hierarchy of clusters. The
hierarchy can be used to identify different levels of similarity in the data.
Density-based clustering:
Density-based clustering is a type of clustering algorithm that groups together data points that are close
to each other in density.
Model-based clustering:
Model-based clustering is a type of clustering algorithm that assumes that the data follows a particular
model, such as a Gaussian distribution. The algorithm then groups together data points that are similar
according to the model.
(10, 12), (11, 13), (12, 14), (13, 15), (14, 16)
(20, 22), (21, 23), (22, 24), (23, 25), (24, 26)
Data warehouses solve this problem by providing a central repository for storing and managing all of a
business's data. This makes it easy to run complex queries and generate reports that can be used to
improve decision-making.
Data mining:
Data mining is the process of identifying patterns and trends in large datasets. Data mining can be used
to identify new market opportunities, optimize pricing strategies, and predict customer behavior.
Machine learning:
Machine learning is a type of artificial intelligence that allows computers to learn without being explicitly
programmed. Machine learning can be used to build predictive models that can be used to improve
business decision-making.
Statistical analysis:
Statistical analysis is the process of collecting, analyzing, and interpreting data. Statistical analysis can be
used to test hypotheses, identify relationships between variables, and make predictions.