Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
46 views31 pages

Past PPR

Data mining encompasses five major elements: extracting and loading data into a warehouse, managing it in a multidimensional database, providing access for analysis, applying analytical software, and presenting data in useful formats. Key techniques include classification, clustering, regression, association rules, outlier detection, sequential patterns, and prediction. Data warehousing supports data mining by providing a centralized repository for consistent and accurate data, facilitating long-term trend analysis and data-driven decision-making.

Uploaded by

sheriazad72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views31 pages

Past PPR

Data mining encompasses five major elements: extracting and loading data into a warehouse, managing it in a multidimensional database, providing access for analysis, applying analytical software, and presenting data in useful formats. Key techniques include classification, clustering, regression, association rules, outlier detection, sequential patterns, and prediction. Data warehousing supports data mining by providing a centralized repository for consistent and accurate data, facilitating long-term trend analysis and data-driven decision-making.

Uploaded by

sheriazad72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Explain the major elements of the data mining?

Data mining consists of five major elements:


• Extract, transform and load transaction data onto the data warehouse system.
• Store and manage the data in a multidimensional database system.
• Provide data access to business analysts and information technology professionals.
• Analyze the data by application software.
• Present the data in a useful format, such as a graph or table.

What are different data mining techniques?

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


1. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.

2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view rooted in statistics, mathematics,
and numerical analysis.

3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling.

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set. Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or medical data
sets.

5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.

7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event

Why should we use data warehousing and how can you extract data for
analysis?
Why Use Data Warehousing:
Data warehousing offers several advantages for organizations seeking to leverage their data effectively. It
provides a centralized repository for data, integrating information from various sources while ensuring
consistency and accuracy. Data warehousing stores historical data, enabling long-term trend analysis. It
also improves data quality through cleansing and transformation processes. With optimized query

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


performance and support for complex analytics, data warehouses facilitate data-driven decision-making.
Robust security features control data access and modifications, enhancing data protection.

Extracting Data for Analysis:


To extract data from a data warehouse for analysis, several methods are available. SQL queries provide a
standard way to retrieve specific data using structured query language. Business intelligence (BI) tools
like Tableau, Power BI, or Looker offer user-friendly interfaces for creating reports and visualizations. ETL
(Extract, Transform, Load) processes transform data and load it into analytics platforms or data lakes.
Some data warehouses provide APIs and integrations for programmatic data access and extraction.
Scheduled jobs automate data extraction, ensuring data currency. Data export options in formats like
CSV or JSON facilitate analysis in other tools or platforms, completing the array of extraction methods.

Explain the use of data mining queries or why data mining queries are
more helpful?

Data mining queries are used to extract patterns and insights from large datasets. They are more helpful
than traditional database queries because they can be used to discover hidden relationships and
patterns in the data that would be difficult or impossible to find using traditional methods.

use of data mining


• Predicting future outcomes. For example, a data mining query could be used to predict which
customers are likely to churn or which products are likely to be popular in the future.
• Identifying patterns and trends. For example, a data mining query could be used to identify the
most common customer segments or the most popular shopping cart combinations.
• Detecting fraud and anomalies. For example, a data mining query could be used to identify
fraudulent credit card transactions or unusual network activity.
• Personalizing recommendations. For example, a data mining query could be used to recommend
products to customers based on their past purchase history.

Data mining queries are more helpful than traditional database queries because they can be used to:

• Find hidden relationships and patterns in the data. Traditional database queries can only be used
to find data that has already been explicitly defined. Data mining queries, on the other hand, can
be used to discover new relationships and patterns in the data that were not previously known.
• Analyze large datasets more efficiently. Traditional database queries can be slow and inefficient
when used to analyze large datasets. Data mining queries, on the other hand, are designed to be
efficient and scalable.
• Provide more insights into the data. Traditional database queries can only be used to retrieve
raw data. Data mining queries, on the other hand, can be used to extract insights from the data,
such as patterns, trends, and predictions.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Describe and briefly discuss the KDD process and draw the figures also.
KDD (Knowledge Discovery in Databases) is a process of discovering useful knowledge and insights from
large and complex datasets. The KDD process involves a range of techniques and methodologies,
including data preprocessing, data transformation, data mining, pattern evaluation, and knowledge
representation. KDD and data mining are closely related processes, with data mining being a key
component and subset of the KDD process.

KDD Process in Data Mining


The KDD process in data mining is a multi-step process that involves various stages to extract useful
knowledge from large datasets.

Data Selection
The first step in the KDD process is identifying and selecting the relevant data for analysis. This involves
choosing the relevant data sources, such as databases, data warehouses, and data streams, and
determining which data is required for the analysis.

Data Preprocessing
After selecting the data, the next step is data preprocessing. This step involves cleaning the data,
removing outliers, and removing missing, inconsistent, or irrelevant data. This step is critical, as the data
quality can significantly impact the accuracy and effectiveness of the analysis.

Data Transformation
Once the data is preprocessed, the next step is to transform it into a format that data mining techniques
can analyze. This step involves reducing the data dimensionality, aggregating the data, normalizing it,
and discretizing it to prepare it for further analysis.

Data Mining
This is the heart of the KDD process and involves applying various data mining techniques to the
transformed data to discover hidden patterns, trends, relationships, and insights. A few of the most
common data mining techniques include clustering, classification, association rule mining, and anomaly
detection.

Pattern Evaluation
After the data mining, the next step is to evaluate the discovered patterns to determine their usefulness
and relevance. This involves assessing the quality of the patterns, evaluating their significance, and
selecting the most promising patterns for further analysis.

Knowledge Representation
This step involves representing the knowledge extracted from the data in a way humans can easily
understand and use. This can be done through visualizations, reports, or other forms of communication
that provide meaningful insights into the data.

Deployment
The final step in the KDD process is to deploy the knowledge and insights gained from the data mining
process to practical applications. This involves integrating the knowledge into decision-making processes
or other applications to improve organizational efficiency and effectiveness.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Discuss the Neural Networks Algorithm in data mining.
Neural Network is an information processing paradigm that is inspired by the human nervous system. As
in the Human Nervous system, we have Biological neurons in the same way in Neural networks we have
Artificial Neurons which is a Mathematical Function that originates from biological neurons. The human
brain is estimated to have around 10 billion neurons each connected on average to 10,000 other
neurons. Each neuron receives signals through synapses that control the effects of the signal on the
neuron.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


What is cluster analysis in Data Mining?
Cluster analysis is a multivariate data mining technique whose goal is to groups objects (eg., products,
respondents, or other entities) based on a set of user selected characteristics or attributes. It is the basic
and most important step of data mining and a common technique for statistical data analysis, and it is
used in many fields such as data compression, machine learning, pattern recognition, information
retrieval etc.

There are a number of different methods to perform cluster analysis. Some of them are,

Hierarchical Cluster Analysis


In this method, first, a cluster is made and then added to another cluster (the most similar and closest
one) to form one single cluster. This process is repeated until all subjects are in one cluster. This
particular method is known as Agglomerative method. Agglomerative clustering starts with single
objects and starts grouping them into clusters. The divisive method is another kind of Hierarchical
method in which clustering starts with the complete data set and then starts dividing into partitions.

Centroid-based Clustering
In this type of clustering, clusters are represented by a central entity, which may or may not be a part of
the given data set. K-Means method of clustering is used in this method, where k are the cluster centers
and objects are assigned to the nearest cluster centers.

Distribution-based Clustering
It is a type of clustering model closely related to statistics based on the modals of distribution. Objects
that belong to the same distribution are put into a single cluster.This type of clustering can capture some
complex properties of objects like correlation and dependence between attributes.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Density-based Clustering
In this type of clustering, clusters are defined by the areas of density that are higher than the remaining
of the data set. Objects in sparse areas are usually required to separate clusters.The objects in these
sparse points are usually noise and border points in the graph.The most popular method in this type of
clustering is DBSCAN.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Explain the techniques of data mining?

Alread explaind
Explain the difference between data mining and data warehousing
S. Basis of
No. Comparison Data Warehousing Data Mining

A data warehouse is a
database system that is
Data mining is the process of
designed for analytical
analyzing data patterns.
analysis instead of
1. Definition transactional work.

Data is stored
Data is analyzed regularly.
2. Process periodically.

Data warehousing is the


Data mining is the use of pattern
process of extracting and
recognition logic to identify
storing data to allow
patterns.
3. Purpose easier reporting.

Data warehousing is Data mining is carried out by


Managing solely carried out by business users with the help of
4. Authorities engineers. engineers.

Data warehousing is the Data mining is considered as a


Data process of pooling all process of extracting data from
5. Handling relevant data together. large data sets.

Subject-oriented,
AI, statistics, databases,
integrated, time-varying
and machine learning systems are
and non-volatile
all used in data mining
constitute data
technologies.
6. Functionality warehouses.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


S. Basis of
No. Comparison Data Warehousing Data Mining

Data warehousing is the


process of extracting and
Pattern recognition logic is used in
storing data in order to
data mining to find patterns.
make reporting more
7. Task efficient.

It extracts data and


This procedure employs pattern
stores it in an orderly
recognition tools to aid in the
format, making reporting
identification of access patterns.
8. Uses easier and faster.

Data mining aids in the creation of


When a data warehouse
suggestive patterns of key
is connected with
parameters. Customer purchasing
operational business
behavior, items, and sales are
systems like CRM
examples. As a result, businesses
(Customer Relationship
will be able to make the required
Management) systems, it
adjustments to their operations and
adds value.
9. Examples production.

How to handle missing values?


Missing values are a common occurrence, and you need to have a strategy for treating them. A missing
value can signify a number of different things in your data. Perhaps the data was not available or not
applicable or the event did not happen. It could be that the person who entered the data did not know
the right value, or missed filling in. Data mining methods vary in the way they treat missing values.
Typically, they ignore the missing values, or exclude any records containing missing values, or replace
missing values with the mean, or infer missing values from existing values.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


• Missing Values Replacement Policies:
• Ignore the records with missing values.
• Replace them with a global constant (e.g., “?”).
• Fill in missing values manually based on your domain knowledge.
• Replace them with the variable mean (if numerical) or the most frequent value (if categorical).
• Use modeling techniques such as nearest neighbors, Bayes’ rule, decision tree, or EM algorithm.

What are the primary data mining tasks?

Write the challenges of data mining?


Data mining, the process of discovering hidden patterns, trends, and insights within large datasets,
[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )
comes with several challenges. These challenges can make the data mining process complex and
demanding. Here are some of the key challenges of data mining:

Data Quality:
Poor data quality can lead to inaccurate results and flawed insights. Incomplete, noisy, or inconsistent
data can hinder the effectiveness of data mining algorithms.

Data Preprocessing:
Before mining can begin, data often needs to be cleaned, transformed, and integrated from various
sources. This preprocessing step can be time-consuming and resource-intensive.

Scalability:
Handling large datasets can be a significant challenge. Many data mining algorithms struggle to scale
efficiently as the size of the dataset increases.

Dimensionality:
High-dimensional data can make it difficult to identify relevant patterns and relationships.
Dimensionality reduction techniques are often needed to reduce the complexity of the data.

Complex Data Types:


Data can come in various formats, including text, images, and videos. Mining these complex data types
requires specialized techniques and tools.

Privacy and Security:


Ensuring the privacy and security of sensitive data is crucial. Data mining may reveal sensitive
information, and there is a need to protect individuals' privacy.

Algorithm Selection:
Choosing the right data mining algorithm for a specific task can be challenging. Different algorithms have
strengths and weaknesses, and selecting the wrong one can lead to suboptimal results.

Interpretability:
Some data mining algorithms, such as deep learning models, can be complex and difficult to interpret.
Understanding the insights generated by these models can be a challenge.

Bias and Fairness:


Data mining can inherit biases present in the data. Ensuring fairness and mitigating bias in the results is
an ongoing challenge, especially in applications like predictive policing or credit scoring.

Data Imbalance:
In many real-world datasets, the distribution of classes or outcomes may be highly imbalanced. This can
affect the performance of data mining algorithms, which may favor the majority class.

What are the applications of data mining?


Data mining is a versatile field that utilizes various techniques to discover patterns, relationships, and
insights within large datasets. Its applications span across different industries and domains. Here are
some common applications of data mining:

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Business Intelligence:
Data mining is extensively used in business for market segmentation, customer profiling, and trend
analysis. It helps businesses make data-driven decisions and optimize strategies.

Customer Relationship Management (CRM):


Data mining assists in identifying valuable customers, predicting their needs, and improving customer
retention through personalized marketing and recommendations.

Fraud Detection:
Financial institutions and e-commerce companies employ data mining to detect fraudulent activities by
analyzing transaction patterns and anomalies.

Healthcare:
Data mining aids in disease prediction, patient diagnosis, and treatment recommendation. It also helps
healthcare providers optimize resource allocation and improve patient outcomes.

Retail:
Retailers use data mining to analyze sales data, optimize inventory, and make pricing decisions. It also
enables them to identify cross-selling and upselling opportunities.

Manufacturing:
Data mining is used for quality control, predictive maintenance, and process optimization in
manufacturing industries. It helps reduce downtime and improve production efficiency.

Telecommunications:
Telecom companies use data mining to analyze call records, network data, and customer behavior to
improve network performance and offer personalized services.

Recommendation Systems:
Online platforms like Netflix and Amazon use data mining to provide personalized recommendations
based on user preferences and behavior.

Environmental Science:
Data mining is used to analyze environmental data for climate modeling, weather forecasting, and
identifying patterns in environmental changes.

Genomics and Bioinformatics:


In genetics, data mining is applied to analyze DNA sequences, identify genetic markers, and discover
relationships between genes and diseases.

Social Network Analysis:


Data mining is used to study social networks, detect influencers, and understand the spread of
information and trends on platforms like Facebook and Twitter.

Energy Management:
Data mining assists in energy consumption analysis, load forecasting, and optimizing energy distribution
in the utility industry.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Human Resources:
HR departments use data mining for talent acquisition, employee performance analysis, and workforce
planning.

Crime Detection and Prevention:


Law enforcement agencies use data mining to analyze crime data and identify patterns to aid in crime
prevention and investigations.

Financial Forecasting:
Data mining is employed in the financial sector for stock market prediction, credit risk assessment, and
portfolio optimization.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


What is Noise? Explain its types.
In data mining, "noise" refers to unwanted or random variations in data that can make it harder to find
meaningful patterns or information. There are different types of noise in data:

Data Noise:
This is when the data itself has errors or mistakes, like missing information or typos. For example, if you
have a list of numbers, data noise might occur if some numbers are recorded incorrectly.

Attribute Noise:
Attribute noise happens when specific parts of the data have errors. For instance, in a list of people's
information, attribute noise could be mistakes in ages or names.

Class Noise:
Class noise occurs when the labels or categories in your data have errors. In a task like sorting emails as
spam or not spam, class noise might mean some emails are labeled wrong.

Contextual Noise:
This type of noise happens when the meaning of data changes depending on the situation. For example,
if you're analyzing data from different sources, there might be contextual noise because the data wasn't
collected in the same way.

Temporal Noise:
Temporal noise is about changes in data over time. It can be things like seasonal patterns or trends.
When analyzing data over time, it's important to consider temporal noise.

Spatial Noise:
Spatial noise is about how data changes based on where it's collected. This can be a problem in things
like maps or GPS data if the location information isn't accurate.

Sensor Noise:
When data is collected using sensors or devices, sensor noise can happen because of measurement
errors or problems with the sensors themselves. This can be a challenge in fields like environmental
monitoring or smart devices.

Write a short note on ANN


Artificial Neural Networks imitate the behavior of the human brain. ANN allows computer programs to
recognize patterns and solve common problems; The brain has neurons that process information in the
form of electric signals; similarly, ANN receives input of information through a number of processors that
operate in parallel and are arranged in tiers. Artificial Neural Networks consist of node layers, containing
an input layer, one or more hidden layers, and an output layer. Each node is connected to another and
has an associated weight. If the output of any individual node is above the associated weight, that node
is activated, sending data to the consequent layer. Otherwise, no data goes to the further stage.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Write a shot note on the follow Neutral Network
Already explain
What is Decision Rules? How does it work?
Decision rules are a fundamental concept in the field of decision-making and decision support systems.
They are used to formalize and automate decision-making processes in various domains, such as
business, healthcare, and engineering. Decision rules are essentially a set of conditions and
corresponding actions that guide a decision-making process. These rules are typically expressed in the

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


form of "if-then" statements, where specific conditions are checked, and if those conditions are met,
certain actions are taken.

Here's how decision rules work:


Conditions:
Decision rules start with a set of conditions or criteria that need to be evaluated. These conditions can
be based on various factors, such as data, expert knowledge, or a combination of both. Conditions are
often expressed as Boolean statements (true or false), and they represent the characteristics or
attributes of the situation being analyzed.

Actions:
For each condition that is met, there is a corresponding action or set of actions associated with it. These
actions represent what should be done when the conditions specified in the rule are satisfied. Actions
can be diverse, ranging from simple recommendations to more complex processes or calculations.

Rule Evaluation:
When a decision needs to be made, the decision system or algorithm evaluates each decision rule one
by one. It checks whether the conditions specified in a rule are true or false based on the available data
or information.

Rule Activation:
If the conditions of a rule are true, the rule is said to be "activated." This means that the associated
actions are triggered and executed. If multiple rules are activated, their actions may be executed
sequentially or concurrently, depending on the system's design.

Decision Making:
The overall decision-making process involves considering the collective actions of all activated rules.
Depending on the context and the specific goals of the decision system, the final decision may be a
combination of these actions, a single action, or a prioritized set of actions.

Feedback and Learning:


Decision rules can also be refined and improved over time through feedback and learning mechanisms. If
certain rules consistently lead to better outcomes, they can be given higher priority or modified to
enhance their performance.

Scalability:
Decision rules can be scaled to accommodate complex decision-making scenarios by creating a large set
of rules that cover various situations and outcomes. However, managing a large number of rules can be
challenging, so techniques like rule prioritization and optimization are often employed.

What is Cluster Analysis? Explain its working with diagram. Discuss its
advantages disadvantages.
Old points Already expalin

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Advantages of Cluster Analysis:

1. It can help identify patterns and relationships within a dataset that may not be immediately
obvious.

2. It can be used for exploratory data analysis and can help with feature selection.

3. It can be used to reduce the dimensionality of the data.

4. It can be used for anomaly detection and outlier identification.

5. It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis:

1. It can be sensitive to the choice of initial conditions and the number of clusters.

2. It can be sensitive to the presence of noise or outliers in the data

3. It can be difficult to interpret the results of the analysis if the clusters are not well-defined.

4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the choice of clustering algorithm used.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Why are preprocessing and dimensionality reduction important phases
in successful data mining applications?
Preprocessing and dimensionality reduction are essential phases in successful data mining applications
because they enhance the quality and efficiency of the entire data analysis process.

Preprocessing involves cleaning and preparing raw data before it is fed into a data mining algorithm. It is
crucial because real-world data is often messy, containing missing values, outliers, and inconsistencies.
By cleaning and standardizing the data, preprocessing ensures that the data mining algorithm can work
effectively and produce meaningful results. It helps in reducing noise and ensuring that the patterns and
insights extracted from the data are accurate and reliable.

Dimensionality reduction, on the other hand, is important because high-dimensional data can pose
several challenges in data mining. High-dimensional data often suffers from the curse of dimensionality,
which can lead to increased computational complexity and decreased algorithm performance.
Dimensionality reduction techniques aim to reduce the number of features or variables in the data while
preserving important information. This not only speeds up the data mining process but also helps in
preventing overfitting and improving the generalizability of the model.

What are the different steps involved in KDD process.


Knowledge Discovery Process steps:-

Data cleaning:
First step in the Knowledge Discovery Process is Data cleaning in which noise and inconsistent data is
removed.

Data Integration:
Second step is Data Integration in which multiple data sources are combined.

Data Selection:
Next step is Data Selection in which data relevant to the analysis task are retrieved from the database.

Data Transformation:
In Data Transformation, data are transformed into forms appropriate for mining by performing summary
or aggregation operations.

Data Mining:
In Data Mining, data mining methods (algorithms) are applied in order to extract data patterns.

Pattern Evaluation:
In Pattern Evaluation, data patterns are identified based on some interesting measures.

Knowledge Presentation:
In Knowledge Presentation, knowledge is represented to user using many knowledge representation
techniques.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


What is an Artificial Neural Network?
Already explained
What is an attribute? Explain different types of attributes.
Attribute:
It can be seen as a data field that represents the characteristics or features of a data object. For a
customer, object attributes can be customer Id, address, etc. We can say that a set of attributes used to
describe a given object are known as attribute vector or feature vector.

Attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).

Quantitative (Numeric, Discrete, Continuous)

Qualitative Attributes:

Nominal Attributes
The values of a Nominal attribute are names of things, some kind of symbols. Values of Nominal
attributes represents some category or state and that’s why nominal attribute also referred as
categorical attributes and there is no order (rank, position) among values of the nominal attribute.

Binary Attributes:
Binary data has only 2 values/states. For Example yes or no, affected or unaffected, true or false.

• Symmetric: Both values are equally important (Gender).

• Asymmetric: Both values are not equally important (Result)

Ordinal Attributes :
The Ordinal Attributes contains values that have a meaningful sequence or ranking(order) between
them, but the magnitude between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.

Quantitative Attributes:
Numeric:
A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types, interval, and ratio.

Discrete :
Discrete data have finite values it can be numerical and can also be in categorical form. These attributes
has finite or countably infinite set of values.

Continuous:
Continuous data have an infinite no of states. Continuous data is of float type. There can be many values
between 2 and 3.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


What are Ensemble methods? Explain the bagging and boosting
ensemble methods.
Ensemble methods is a machine learning technique combining multiple individual models to create a
stronger, more accurate predictive model. By leveraging the diverse strengths of different models,
ensemble learning aims to mitigate errors, enhance performance, and increase the overall robustness of
predictions, leading to improved results across various tasks in machine learning and data analysis.

Bagging

The steps of bagging are as follows:

1. We have an initial training dataset containing n-number of instances.

2. We create a m-number of subsets of data from the training set. We take a subset of N sample
points from the initial dataset for each subset. Each subset is taken with replacement. This
means that a specific data point can be sampled more than once.

3. For each subset of data, we train the corresponding weak learners independently. These models
are homogeneous, meaning that they are of the same type.

4. Each model makes a prediction.

5. The predictions are aggregated into a single prediction. For this, either max voting or averaging is
used.

Boosting

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Boosting works with the following steps:

1. We sample m-number of subsets from an initial training dataset.

2. Using the first subset, we train the first weak learner.

3. We test the trained weak learner using the training data. As a result of the testing, some data

points will be incorrectly predicted.

4. Each data point with the wrong prediction is sent into the second subset of data, and this subset

is updated.

5. Using this updated subset, we train and test the second weak learner.

6. We continue with the following subset until the total number of subsets is reached.

7. We now have the total prediction. The overall prediction has already been aggregated at each

step, so there is no need to calculate it.

Suppose that the data for analysis includes the attribute price (in dollars)
8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34.Use smoothing by bin means to
smooth these data, using a bin depth of 4. Illustrate your steps.
steps to illustrate how to do this:

Step 1: Sort the Data First, let's sort the data in ascending order: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34

Step 2: Create Bins Next, we need to create bins with a depth of 4. This means we'll group the data into

sets of 4 values each. Start by defining the bins:

• Bin 1: 8, 9, 15, 16

• Bin 2: 21, 21, 24, 26

• Bin 3: 27, 30, 30, 34

Step 3: Calculate Bin Means Now, calculate the mean (average) value for each bin:

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


• Bin 1 Mean: (8 + 9 + 15 + 16) / 4 = 12

• Bin 2 Mean: (21 + 21 + 24 + 26) / 4 = 23

• Bin 3 Mean: (27 + 30 + 30 + 34) / 4 = 30.25

Step 4: Replace Data Points Replace each data point in the original data set with the mean value of its

corresponding bin. Here's the smoothed data:

Smoothed Data: 12, 12, 12, 12, 23, 23, 23, 23, 30.25, 30.25, 30.25, 30.25

Which tool is used to summarize similarity measurements? Discuss the


ways to measure the distance between an object and a cluster?
Tool for Summarizing Similarity Measurements:
One commonly used tool for summarizing similarity measurements in the context of clustering and
distance calculations is a dendrogram. A dendrogram is a tree-like diagram that represents the
hierarchical structure of clusters in a dataset. It helps visualize how data points are grouped together
based on their similarity or distance from each other.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Measuring Distance Between an Object and a Cluster:
There are several methods to measure the distance between an object and a cluster in the context of
clustering algorithms like hierarchical clustering or k-means. These distance metrics help determine how
close or dissimilar a data point is from a cluster of data points. Here are some common ways to measure
this distance:

Euclidean Distance:
Euclidean distance is the most widely used distance metric. It calculates the straight-line distance
between two points in the multi-dimensional space. For a data point and a cluster, you can compute the
Euclidean distance between the data point and the centroid (mean) of the cluster.

Manhattan Distance:
Manhattan distance, also known as L1 distance or city block distance, measures the distance between
two points by summing the absolute differences of their coordinates. It can be used to calculate the
distance between a point and a cluster's centroid.

Cosine Similarity:
Cosine similarity measures the cosine of the angle between two vectors. In the context of clustering, you
can calculate the cosine similarity between a data point and the centroid of a cluster. This is particularly
useful for text data or when you want to measure similarity based on the direction of vectors.

Mahalanobis Distance:
Mahalanobis distance considers the correlations between variables in the dataset. It's a more advanced
metric that takes into account the covariance structure of the data. It is useful when dealing with
multivariate data.

Jaccard Similarity:
Jaccard similarity measures the similarity between two sets by comparing the size of their intersection to
the size of their union. It's often used for clustering binary or categorical data.

Correlation-Based Distance:
For datasets with continuous variables, you can use correlation-based distances like Pearson or
Spearman correlation coefficients to measure the similarity between a data point and a cluster's central
tendency.

Apply the KNN classifier to predict the diabetic patient with the given
features BMI & age. Assume k-3. Test example: BMI-43.6, Age=40,
Sugar?
Formula

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


BMI Age Sugar Formula Distance

33.6 50 1 √((43.6-33.6)^2+(40-50)^2 ) 14.14

26.6 30 O √((43.6-26.6)^2+(40-30)^2 ) 19.72

23.4 40 O √((43.6-23.4)^2+(40-40)^2 ) 20.20

43.1 67 O √((43.6-43.1)^2+(40-67)^2 ) 27.00

35.3 23 1 √((43.6-35.3)^2+(40-23)^2 ) 18.92

35.9 67 1 √((43.6-35.9)^2+(40-67)^2 ) 28.08

36.7 45 1 √((43.6-36.7)^2+(40-45)^2 ) 8.52

25.7 46 O √((43.6-25.7)^2+(40-46)^2 ) 18.88

23.3 29 O √((43.6-23.3)^2+(40-29)^2 ) 23.09

31 56 1 √((43.6-31)^2+(40-56)^2 ) 20.37

Discuss three applications of data mining for mining web data.

1. Web Content Mining:


Web content mining is the process of extracting valuable information, patterns, and knowledge from the
textual content found on websites. This application of data mining involves techniques such as text
classification, information retrieval, and natural language processing (NLP) to analyze and organize web
data. One practical application of web content mining is sentiment analysis of online product reviews.
Companies can use data mining to automatically analyze customer reviews and determine the sentiment
(positive, negative, or neutral) associated with their products or services. This valuable insight can help
businesses make informed decisions about product improvements or marketing strategies.

2. Web Usage Mining:


Web usage mining focuses on analyzing user interactions with websites, including their browsing
behavior, clicks, and navigation patterns. This application of data mining helps businesses understand
user preferences, improve website usability, and enhance the overall user experience. One example of
web usage mining is personalized recommendation systems used by e-commerce platforms. By analyzing
users' past interactions (such as product views and purchases), these systems can suggest products that
are likely to interest individual users. This not only increases the chances of making a sale but also
enhances user satisfaction.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


3. Web Structure Mining:

Web structure mining focuses on analyzing the link structure of the World Wide Web, including
hyperlinks between web pages. This application of data mining helps uncover valuable insights about
website relationships, authority, and connectivity. Search engine optimization (SEO) is a prominent
application of web structure mining. By analyzing the link structure of the web, search engines can
determine the authority and relevance of web pages. Pages with more inbound links from reputable
sources are considered authoritative and are more likely to rank higher in search engine results.

Discuss Hold-out method vs Cross-validation


Cross-validation is usually the preferred method because it gives your model the opportunity to train on
multiple train-test splits. This gives you a better indication of how well your model will perform on
unseen data. Hold-out, on the other hand, is dependent on just one train-test split. That makes the hold-
out method score dependent on how the data is split into train and test sets.

The hold-out method is good to use when you have a very large dataset, you’re on a time crunch, or you
are starting to build an initial model in your data science project. Keep in mind that because cross-
validation uses multiple train-test splits, it takes more computational power and time to run than using
the holdout method.

Discuss the two types of Hierarchical clustering.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Find the string edit distance of the following:

i)edit_distance(Faloutsos, Kollios)

ii) edit_distance(Faloutsos, Gough)


Faloutsos, Kollios:
Character in Faloutsos Character in Kollios Edit distance
F K 1
a o 1
l l 0
o i 1
u o 1
t s 1
s 0 1
The total edit distance is the sum of the edit distances between each pair of characters, which is 6.

Faloutsos, Gough:
Character in Faloutsos Character in Gough Edit distance
F G 1
a o 1
l u 1
o g 1
u h 1
t 0 1
s 0 1

The total edit distance is the sum of the edit distances between each pair of characters, which is 7.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Briefly tell about mean, median, mode, standard deviation, variance, five
number summary using any example.
Mean
Mean is the average of a dataset. To find the mean, add up all the values in the dataset and divide by the
number of values.

Median
Median is the middle value in a dataset when the data is arranged in numerical order. If there are two
middle values, the median is the average of those two values.

Mode
Mode is the most frequent value in a dataset.

Standard deviation
Standard deviation is a measure of how spread out the values in a dataset are. A higher standard
deviation means that the values are more spread out, while a lower standard deviation means that the
values are more concentrated around the mean.

Variance
Variance is the square of the standard deviation. It is another measure of how spread out the values in a
dataset are.

Five number summary


Five number summary is a way to describe the distribution of a dataset using five numbers: the
minimum value, the first quartile (Q1), the median, the third quartile (Q3), and the maximum value.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Example:
Consider the following dataset of heights in inches:

[65, 67, 68, 69, 70, 71, 72, 73, 74, 75]

The mean height is 70.5 inches. The median height is 70 inches. The mode height is 70 inches. The
standard deviation is 2.5 inches. The variance is 6.25 inches. The five number summary is as follows:

Minimum: 65 inches

Q1: 67 inches

Median: 70 inches

Q3: 73 inches

Maximum: 75 inches

Briefly tell about decision tree and naïve bayes classifiers.


Decision tree
Decision trees are a type of machine learning algorithm that can be used for both classification and
regression tasks. They work by building a tree-like structure, where each node in the tree represents a
question about one of the features in the data. The branches of the tree represent the different possible
answers to the question.

To classify a new data point, the decision tree starts at the root node and asks the question associated
with that node. It then follows the branch that corresponds to the answer to the question. This process
continues until the decision tree reaches a leaf node, which contains the predicted class label.

Naïve Bayes classifiers


Naïve Bayes classifiers are another type of machine learning algorithm that can be used for classification
tasks. They are based on Bayes' theorem, which is a mathematical formula for calculating the probability
of one event occurring given another event.Naïve Bayes classifiers work by calculating the probability of
each class label given the values of the features in the data. They then predict the class label with the
highest probability.

Define Clustering. Tell at least two situations where we use clustering.


Write its different types. Take any example of type (x,y) and apply K-
Means Clustering algorithm
Clustering is a machine learning technique that groups data points into clusters based on their similarity.
Clustering is an unsupervised learning technique, which means that it does not require any labeled data.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Clustering is used :
Customer segmentation:
Clustering can be used to segment customers into different groups based on their demographics,
purchase history, and other factors. This information can then be used to target customers with relevant
marketing messages and offers.

Fraud detection:
Clustering can be used to identify fraudulent transactions by grouping together transactions that exhibit
similar patterns.

Image segmentation:
Clustering can be used to segment images into different objects, such as people, cars, and trees. This can
be useful for tasks such as object detection and recognition.

Medical diagnosis:
Clustering can be used to group patients with similar symptoms and medical history. This information
can then be used to diagnose diseases and recommend treatments.

Different types of clustering algorithms


K-means clustering:
K-means clustering is a simple and efficient clustering algorithm that works by dividing the data into a
predefined number of clusters.

Hierarchical clustering:
Hierarchical clustering is a type of clustering algorithm that produces a hierarchy of clusters. The
hierarchy can be used to identify different levels of similarity in the data.

Density-based clustering:
Density-based clustering is a type of clustering algorithm that groups together data points that are close
to each other in density.

Model-based clustering:
Model-based clustering is a type of clustering algorithm that assumes that the data follows a particular
model, such as a Gaussian distribution. The algorithm then groups together data points that are similar
according to the model.

Example of K-means clustering


Consider the following example of a dataset of (x, y) coordinates:

(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)

(10, 12), (11, 13), (12, 14), (13, 15), (14, 16)

(20, 22), (21, 23), (22, 24), (23, 25), (24, 26)

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


Briefly discuss Data warehouse. Its need in modern age. Differentiate
between OLTP and OLAP. What is applied on data ware house.
Data warehouse
A data warehouse is a system that stores and manages data from multiple sources for analytical
purposes. It is typically used to store historical data, which can then be analyzed to identify trends and
patterns. Data warehouses are designed to support complex queries and fast response times, even for
very large datasets.

Need for data warehouses in the modern age


In the modern age, businesses are collecting more data than ever before. This data can be used to gain
valuable insights into customer behavior, market trends, and operational performance. However, this
data is often scattered across multiple systems and in different formats. This makes it difficult to analyze
and extract meaningful insights.

Data warehouses solve this problem by providing a central repository for storing and managing all of a
business's data. This makes it easy to run complex queries and generate reports that can be used to
improve decision-making.

Differences between OLTP and OLAP


OLTP (online transaction processing) and OLAP (online analytical processing) are two different types of
data processing systems. OLTP systems are designed to process a large volume of real-time transactions,

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )


such as those generated by e-commerce websites and point-of-sale systems. OLAP systems, on the other
hand, are designed to perform complex analysis on large datasets.

Characteristic OLTP OLAP


Primary purpose Process transactions Analyze data
Data type Real-time, transactional data Historical, aggregated data
Data source Single source Multiple sources
Database schema Normalized Denormalized
Query complexity Simple Complex
Response time Fast Slow
Applied on data warehouse
Data warehouses can be used to apply a variety of analytical techniques, including:

Data mining:
Data mining is the process of identifying patterns and trends in large datasets. Data mining can be used
to identify new market opportunities, optimize pricing strategies, and predict customer behavior.

Machine learning:
Machine learning is a type of artificial intelligence that allows computers to learn without being explicitly
programmed. Machine learning can be used to build predictive models that can be used to improve
business decision-making.

Statistical analysis:
Statistical analysis is the process of collecting, analyzing, and interpreting data. Statistical analysis can be
used to test hypotheses, identify relationships between variables, and make predictions.

[email protected] 0309-6054532. (IF FIND ANY MISTAKE CONTACT ME )

You might also like