Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views28 pages

Data Ming

Uploaded by

kathularajitha7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

Data Ming

Uploaded by

kathularajitha7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES

(UGC AUTONOMOUS)
MCA I Year I Semester Examinations, AUG – 2025
DATA MINING
(COMPUTER SCIENCE AND ENGINEERING)
Time: 3 Hours Max.Marks:60

PART - A
Answer all the questions:

1.a) Define clustering?


Clustering is the process of grouping a set of objects into clusters such that objects in the
same cluster are more similar to each other than to those in other clusters.

b) What is the difference between similarity and dissimilarity?


Similarity measures how alike two objects are, with higher values indicating more similarity.
Dissimilarity measures how different two objects are, with higher values indicating less
similarity.

c) List any two applications of Association Rules?

1. Market Basket Analysis (finding products often bought together).


2. Cross-selling and up-selling in retail.

d) Define Support and Confidence in Association Rule Mining?

 Support: The proportion of transactions in the dataset that contain a particular


itemset.
 Confidence: The likelihood that a transaction containing itemset A also contains
itemset B (conditional probability).

e) What is the purpose of pruning in Decision Tree construction?


Pruning reduces overfitting by removing branches that have little importance or do not
improve accuracy on unseen data, making the model simpler and more generalizable.

f) Write one advantage of Naïve Bayes classifier?


It is simple, fast, and works well with high-dimensional data even with the assumption of
feature independence.

g) Define Data Mining?


Data Mining is the process of discovering useful patterns, knowledge, and insights from large
datasets using techniques from statistics, machine learning, and database systems.

h) List two differences between K-Means and Hierarchical clustering.


 K-Means requires specifying the number of clusters beforehand; Hierarchical does
not.
 K-Means is a partitioning method; Hierarchical creates a tree-like structure
(dendrogram).

i) What is Web Usage Mining?


Web Usage Mining is the process of extracting useful patterns from web user behavior data
like web logs to understand user navigation and improve web services.

j) Define Text Mining?


Text Mining is the process of extracting meaningful information and patterns from
unstructured text data using techniques like natural language processing (NLP) and machine
learning.

PART - B

2. (a) Define Data Mining. Explain its role in Knowledge Discovery in


Databases (KDD)?

Definition of Data Mining:


Data Mining is the process of automatically discovering useful, previously unknown, and
potentially actionable patterns and knowledge from large volumes of data using methods
from statistics, machine learning, and database systems.

Role of Data Mining in Knowledge Discovery in Databases (KDD):

 Knowledge Discovery in Databases (KDD) is the overall process of converting raw


data into useful knowledge. It includes multiple steps: data cleaning, data integration,
data selection, data transformation, data mining, pattern evaluation, and knowledge
presentation.
 Data Mining is a core step within KDD focused specifically on the application of
algorithms to extract patterns from the prepared data.
 While KDD encompasses the entire pipeline of discovering knowledge, Data Mining
is the stage where computational techniques are applied to identify patterns such as
classification, clustering, association rules, and anomalies.
 Thus, Data Mining acts as the "engine" of KDD, turning processed data into
actionable insights, which are then evaluated and interpreted to gain knowledge.

2. (b) Discuss major challenges and tasks involved in Data Mining?

Major Challenges in Data Mining:

1. Handling Large Volumes of Data:


Data mining algorithms must be efficient and scalable to process massive datasets
generated by modern applications.
2. Data Quality and Preprocessing:
Real-world data is often noisy, incomplete, inconsistent, or contains errors. Proper
data cleaning, integration, and transformation are critical before mining.
3. High Dimensionality:
Many datasets have a large number of attributes/features, making it challenging to
find meaningful patterns without dimensionality reduction techniques.
4. Privacy and Security Concerns:
Mining sensitive data raises ethical issues and requires techniques to protect user
privacy and secure data.
5. Complex and Heterogeneous Data Types:
Data can be structured, semi-structured, or unstructured (e.g., text, images), requiring
different mining approaches.
6. Dynamic and Streaming Data:
Data can be continuously generated (e.g., sensor networks, web logs), demanding
real-time or incremental mining methods.
7. Interpretability of Results:
Extracted patterns must be understandable and actionable for users, requiring clear
presentation and visualization.

Major Tasks Involved in Data Mining:

1. Classification:
Assigning data items to predefined categories or classes based on training data.
2. Clustering:
Grouping similar data points into clusters without predefined labels.
3. Association Rule Mining:
Discovering interesting relationships or correlations between items in large datasets
(e.g., market basket analysis).
4. Regression:
Predicting continuous numeric values based on input variables.
5. Anomaly Detection:
Identifying unusual or rare data points that do not conform to expected patterns.
6. Summarization:
Providing a compact description or summary of the dataset.
7. Trend and Pattern Analysis:
Discovering temporal or sequential patterns over time.

OR
3. (a) Explain different types of data preprocessing techniques (data cleaning,
transformation, discretization)?

Data preprocessing is an essential step in the data mining process that involves preparing
raw data to improve its quality and make it suitable for mining. Key preprocessing techniques
include:

1. Data Cleaning
Data cleaning involves detecting and correcting (or removing) errors and inconsistencies in
data to improve data quality. It addresses issues such as:

 Handling Missing Values: Filling missing values using methods like


mean/mode/median substitution, or using prediction models.
 Noise Removal: Smoothing noisy data by techniques such as binning, regression, or
clustering.
 Correction of Inconsistencies: Identifying and fixing contradictory data entries or
duplicates.

2. Data Transformation

Data transformation converts data into appropriate forms for mining:

 Normalization: Scaling data to a specific range, e.g., 0 to 1, to ensure all features


contribute equally.
 Aggregation: Summarizing data, e.g., computing averages over time periods.
 Generalization: Replacing detailed data with higher-level concepts, e.g., converting
ages into age groups.

3. Data Discretization

Discretization converts continuous attributes into discrete intervals or categories. This


simplifies data and is especially useful for algorithms that require categorical input. Methods
include:

 Binning: Dividing the range of data into intervals (equal-width or equal-frequency


bins).
 Clustering-based Discretization: Grouping data points into clusters and treating
each cluster as an interval.
 Entropy-based Methods: Using information gain to find the best split points.

3.(b) Write about similarity and dissimilarity measures with


examples?

Similarity and Dissimilarity Measures

In data mining and machine learning, similarity and dissimilarity measures are essential to
compare data objects. They help quantify how alike or how different two objects are. These
measures are foundational in clustering, classification, and pattern recognition.

Similarity Measures

 Definition: Similarity measures indicate the degree to which two objects resemble
each other.
 Range: Typically between 0 and 1, where:
o 0 means no similarity,
o 1 means identical objects.
 Usage: Used when you want to find how close or similar two objects are.
 Interpretation:
o 1 means vectors are exactly the same direction,
o 0 means orthogonal (no similarity),
o -1 means opposite directions.

Dissimilarity Measures

 Definition: Dissimilarity (or distance) measures quantify how different two objects
are.
 Range: Usually non-negative values starting from 0.
o 0 means objects are identical,
o larger values mean more dissimilar.
 Usage: Used to find how far apart two objects are in the feature space.

Common Examples:

1. Euclidean Distance

Measures the straight-line distance between two points in n-dimensional space.

4. (a) Explain the concept of Association Rules with suitable examples?


4. (a) Explain the concept of Association Rules with suitable examples.

Association Rules are a popular data mining technique used to find interesting relationships,
patterns, or correlations among a large set of items in transactional databases.

What are Association Rules?

An association rule is an implication of the form:

A⇒BA \Rightarrow BA⇒B

where AAA and BBB are itemsets (sets of items), and A∩B=∅A \cap B = \emptysetA∩B=∅
(meaning AAA and BBB do not overlap). This rule means:

"If AAA occurs in a transaction, then BBB is likely to occur in the same transaction."

Example:

Consider the following transactions from a supermarket database:

Transaction ID Items Bought


1 Bread, Milk
2 Bread, Diapers, Beer, Eggs
3 Milk, Diapers, Beer, Cola
4 Bread, Milk, Diapers, Beer
5 Bread, Milk, Diapers, Cola

Analyze the rule:

Diapers⇒Beer\text{Diapers} \Rightarrow \text{Beer}Diapers⇒Beer

 Support: Number of transactions containing both Diapers and Beer = Transactions 2,


3, 4 → 3 out of 5 total

 Confidence: Number of transactions containing both Diapers and Beer divided by
number containing Diapers

Diapers appear in Transactions 2, 3, 4, 5 → 4 total

Interpretation: 75% of the transactions that contain diapers also contain beer.
4. (b) Discuss the measures of rule interestingness: Support,
Confidence, and Lift.

In Association Rule Mining, rule interestingness measures evaluate how strong and useful a
discovered rule A⇒BA \Rightarrow BA⇒B is. The three most common measures are:

OR
5(a) Describe the working of the Apriori Algorithm with an example.

The Apriori algorithm is a fundamental algorithm in Association Rule Mining used to


identify frequent itemsets in a transactional database and to generate association rules from
them.

It is based on the Apriori property:

“All non-empty subsets of a frequent itemset must also be frequent.”

This principle helps prune the search space efficiently by eliminating itemsets that have
infrequent subsets.

Working Steps of the Apriori Algorithm:

1. Set a minimum support threshold to determine which itemsets are considered


"frequent."
2. Find frequent 1-itemsets (L1):
Count how many transactions each individual item appears in and keep those that
meet the support threshold.
3. Generate candidate 2-itemsets (C2) from L1 by joining frequent 1-itemsets.
4. Count support for each candidate in C2 and retain those meeting the support
threshold (forming L2).
5. Repeat:
Generate candidate k-itemsets (Ck) from frequent (k−1)-itemsets (Lk−1), count
support, and prune infrequent ones.
6. Stop when no more frequent itemsets can be generated.
7. Generate association rules from frequent itemsets that satisfy a minimum
confidence threshold.

Example:

Transaction Database (5 transactions):

TID Items
1 {Milk, Bread}
2 {Milk, Diapers, Beer, Bread}
3 {Milk, Diapers, Beer, Cola}
4 {Diapers, Beer, Bread}
5 {Milk, Bread, Diapers, Cola}

Minimum Support Threshold: 60% → i.e., at least 3 transactions

Step-by-Step Execution:

Step 1: Find Frequent 1-itemsets (L1)


Item Support Count Support (%) Frequent?
Milk 4 80% Yes
Bread 4 80% Yes
Diapers 4 80% Yes
Beer 3 60% Yes
Cola 2 40% No

L1 = {Milk, Bread, Diapers, Beer}

Step 2: Generate Candidate 2-itemsets (C2)

Candidates:

 {Milk, Bread}
 {Milk, Diapers}
 {Milk, Beer}
 {Bread, Diapers}
 {Bread, Beer}
 {Diapers, Beer}

Step 3: Count Support for C2

Itemset Support Count Support (%) Frequent?


{Milk, Bread} 3 60% Yes
{Milk, Diapers} 3 60% Yes
{Milk, Beer} 2 40% No
{Bread, Diapers} 3 60% Yes
{Bread, Beer} 2 40% No
{Diapers, Beer} 3 60% Yes

L2 = { {Milk, Bread}, {Milk, Diapers}, {Bread, Diapers}, {Diapers, Beer} }

Step 4: Generate Candidate 3-itemsets (C3)

Using L2, generate:

 {Milk, Bread, Diapers}


 {Milk, Diapers, Beer}
 {Bread, Diapers, Beer}

Step 5: Count Support for C3


Itemset Support Count Support (%) Frequent?
{Milk, Bread, Diapers} 2 40% No
{Milk, Diapers, Beer} 2 40% No
{Bread, Diapers, Beer} 2 40% No

❌ No frequent 3-itemsets.

Final Frequent Itemsets:

 L1: {Milk}, {Bread}, {Diapers}, {Beer}


 L2: {Milk, Bread}, {Milk, Diapers}, {Bread, Diapers}, {Diapers, Beer}
 L3: None

5(b) Compare Apriori with FP-Growth Algorithm?

comparison between Apriori and FP-Growth algorithms, two popular methods for frequent
itemset mining in association rule learning:

Aspect Apriori Algorithm FP-Growth Algorithm


Generates candidate itemsets and Builds a compact data structure called
Basic Idea tests their frequency in the FP-tree to mine frequent patterns
database iteratively. without candidate generation.
Explicitly generates candidate Does not generate candidates
Candidate
itemsets and prunes them based on explicitly; uses a divide-and-conquer
Generation
minimum support. strategy on FP-tree.
Uses FP-tree, a prefix-tree structure
Data Structure Uses a simple database scan and
storing compressed information about
Used candidate itemsets list.
frequent patterns.
Number of Multiple scans, one for each level Typically only 2 scans: one to build
Database Scans of itemsets (k-itemsets). FP-tree, one for mining patterns.
Faster and more efficient, especially on
Can be slow due to large candidate
Performance large datasets with many frequent
sets and multiple database scans.
itemsets.
Less memory-efficient due to More memory-efficient by compressing
Memory Usage
storing candidate sets explicitly. data in the FP-tree.
Complexity grows exponentially
Usually better scalability by avoiding
Complexity with number of items due to
candidate generation.
candidate generation.
Works well for smaller datasets or
Suitable for large datasets and dense
Suitability datasets with small frequent
data with many frequent patterns.
itemsets.

6(a) Explain the process of building a decision tree with suitable dataset?
A Decision Tree is a tree-like model used for classification and regression tasks in machine
learning.
It represents decisions and their possible consequences in a flowchart-like structure.

 Internal nodes represent tests on attributes.


 Branches represent the outcomes of the tests.
 Leaf nodes represent the final decision or class label.

Steps to Build a Decision Tree:

1. Select the Best Attribute to Split:


o Use criteria like Information Gain (entropy-based) or Gini Index to choose
the attribute that best splits the data.
2. Split the Dataset:
o Divide the dataset based on the values of the selected attribute.
3. Repeat Recursively:
o For each subset, repeat the process until:
 All instances belong to the same class, or
 No attributes are left, or
 A stopping criterion (like max depth) is met.
4. Assign Class Labels:
o When no further splitting is possible, assign the most common class in that
subset to the leaf node.
5. Example Dataset – Play Tennis

Outlook Temperature Humidity Wind Play Tennis


Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No

Building the Tree – Key Steps

1. Calculate Entropy of the Target Attribute (Play Tennis):

 Total = 14
 Yes = 9, No = 5
Resulting Decision Tree (Simplified)

OR

7(a) Describe Naïve Bayes Classifier with an Example

Naïve Bayes is a probabilistic classification algorithm based on Bayes’ Theorem with a


strong assumption of feature independence given the class.

Despite the "naïve" assumption, it works surprisingly well in practice, especially in text
classification, such as spam detection, sentiment analysis, etc.
Example: Spam Email Classification

📄 Dataset:

Email ID Win Offer Free Class


1 Yes Yes No Spam
2 No Yes Yes Spam
3 Yes No No Not Spam
4 No No Yes Not Spam
5 Yes Yes Yes Spam
(b) Explain the working of K-Nearest Neighbor classification?
What is K-NN?

K-Nearest Neighbor (K-NN) is a supervised learning algorithm used for both classification
and regression. It is an instance-based (non-generalizing) and lazy learning algorithm — it
stores all training data and delays computation until prediction time.

How K-NN Works (Step-by-Step):

1. Choose the Value of K


 Decide how many neighbors to consider (e.g., K = 3 or 5).
 A small K is sensitive to noise; a large K may smooth out the decision boundary.

2. Calculate Distances

 Compute the distance between the test point and all training points.
 Common distance metrics:
o

3. Find the K Nearest Neighbors

 Select the K training points with the smallest distances to the test point.

4. Vote for the Class

 Each of the K neighbors "votes" for their class.


 The majority class among them is assigned to the test point.

5. Output the Predicted Class

 Return the class with the most votes.

Example: Classify a New Point Using K-NN

Training Data:

Point x y Class
A 1 2 Red
B 2 3 Red
C 3 3 Blue
D 6 5 Blue
E 7 8 Blue
❓ Test Point: (3, 4)

8(a) Define clustering. Differentiate between partitioning and


hierarchical clustering?

Clustering is an unsupervised machine learning technique used to group a set of data


points into clusters such that:

 Data points within a cluster are similar to each other.


 Data points in different clusters are dissimilar.

Clustering helps in pattern discovery, data summarization, and customer segmentation,


etc.

Key Idea:

"Group similar data points together without using labeled data."

Two main types are:


Type Description
Partitioning
Divides data into a fixed number of non-overlapping clusters.
Clustering
Hierarchical Builds a tree-like structure (dendrogram) of clusters by progressively
Clustering merging or splitting them.

Difference: Partitioning vs Hierarchical Clustering

Feature Partitioning Clustering Hierarchical Clustering


Divides data into K non- Builds a tree-like structure
Definition
overlapping clusters. (dendrogram).
Requires pre-defined number of No need to specify number of clusters
Input Required
clusters K. initially.
Algorithm Examples K-Means, K-Medoids Agglomerative, Divisive
Flat clustering – one level of
Approach Nested clustering – multiple levels.
clustering.
Done by merging (bottom-up) or
Merging/Splitting All clusters formed at once.
splitting (top-down).
Computational Typically faster, especially K-
Slower, especially with large datasets.
Complexity Means.
Can cut the dendrogram at different
Reversibility Once clustered, can't go back.
levels.
Harder to visualize Produces a dendrogram for easy
Visualization
relationships. visualization.

(b) Explain the K-Means algorithm with a worked-out example?

K-Means is a partitioning clustering algorithm that divides n data points into K clusters in
such a way that each data point belongs to the cluster with the nearest mean (called the
centroid

Steps of the K-Means Algorithm:

1. Choose the number of clusters (K).


2. Initialize K centroids randomly (or using heuristics like K-Means++).
3. Assign each data point to the nearest centroid (based on Euclidean distance).
4. Update centroids by computing the mean of all data points in a cluster.
5. Repeat steps 3–4 until:
o No change in cluster assignments, or
o Centroids no longer move significantly, or
o Max iterations reached.

Worked-Out Example:

Let’s cluster the following 2D points into K = 2 clusters:

Point X Y
A 11
Point X Y
B 21
C 43
D 54

Step 1: Choose K = 2

We want 2 clusters.

Step 2: Initialize Centroids

Assume initial centroids:

 Centroid 1 (C1): A (1, 1)


 Centroid 2 (C2): C (4, 3)

Point Distance to C1 (1,1) Distance to C2 (4,3) Cluster


A (1,1) 0.00 3.61 C1
B (2,1) 1.00 2.83 C1
C (4,3) 3.61 0.00 C2
D (5,4) 5.00 1.41 C2

Step 4: Update Centroids

New centroid for C1 (mean of A, B):

Step 5: Reassign Points to New Centroids

Point Distance to C1 (1.5,1) Distance to C2 (4.5,3.5) Cluster


A 0.5 4.30 C1
B 0.5 3.54 C1
C 3.20 0.71 C2
D 4.30 0.71 C2

No change in cluster assignments → Converged


Final Clusters:

 Cluster 1 (C1): A (1,1), B (2,1)


 Cluster 2 (C2): C (4,3), D (5,4)

OR

9(a) Explain Agglomerative and Divisive Hierarchical Clustering


methods?

Agglomerative Hierarchical Clustering (Bottom-Up Approach)


How it works:

 Start with each data point as its own cluster (i.e., if there are 10 data points, start
with 10 clusters).
 Iteratively merge the two closest clusters based on a distance metric (e.g., Euclidean
distance).
 Continue merging until all points are in a single cluster or until a stopping condition
is met.

Steps:

1. Compute the distance matrix between all data points.


2. Merge the two closest clusters.
3. Update the distance matrix.
4. Repeat steps 2–3 until only one cluster remains.

Distance Metrics / Linkage Criteria:

 Single linkage: Minimum distance between points in the two clusters.


 Complete linkage: Maximum distance between points.
 Average linkage: Average distance between points.
 Ward’s method: Minimizes the total within-cluster variance.

Pros:

 Simple and intuitive.


 No need to specify the number of clusters in advance.

Cons:

 Computationally expensive (especially for large datasets).


 Not easily reversible (once clusters are merged, they can't be split).

Divisive Hierarchical Clustering (Top-Down Approach)


How it works:
 Start with all data points in a single cluster.
 Recursively split the clusters into smaller clusters.
 Continue splitting until each data point is in its own cluster, or a stopping condition is
met.

Steps:

1. Begin with all points in one cluster.


2. Use a clustering algorithm (e.g., k-means or spectral clustering) to split the cluster
into two.
3. Choose a cluster to split based on some criterion (e.g., largest diameter or variance).
4. Repeat until desired number of clusters is reached or each point is its own cluster.

Pros:

 Can be more accurate in some cases.


 Focuses on identifying large, meaningful clusters first.

Cons:

 More computationally intensive than agglomerative.


 Less commonly used due to complexity in choosing the best way to split.

(b) Discuss key issues in clustering and outlier detection?

Both clustering and outlier detection are unsupervised learning tasks in data mining. While
clustering aims to group similar data, outlier detection focuses on identifying data points that
are significantly different from others. However, both face several challenges or key issues,
which are discussed below.

Key Issues in Clustering

1. Choosing the Right Number of Clusters

o Most clustering algorithms (like k-means) require you to specify the number of
clusters (k) in advance.
o It’s often unclear how many natural groupings exist in the data.
o Solutions: Use methods like elbow method, silhouette score, or gap statistic.
2. High Dimensionality

o As the number of features increases, distance metrics (like Euclidean) become less
meaningful.
o Known as the curse of dimensionality.
o Solution: Use dimensionality reduction techniques like PCA, t-SNE, or feature
selection.
3. Scalability

o Clustering large datasets (millions of points) is computationally expensive.


o Time and space complexity become major concerns.
o Solution: Use efficient or approximate algorithms (e.g., mini-batch k-means).
4. Cluster Shape and Size

o Many algorithms (e.g., k-means) assume spherical or convex clusters of similar


size.
o Real-world data may have irregular or nested clusters.
o Solution: Use density-based methods like DBSCAN that detect arbitrary shapes.
5. Noise and Outliers

o Outliers can distort cluster centers and boundaries.


o Especially problematic in algorithms that rely on mean or centroid (e.g., k-means).
o Solution: Use robust algorithms or integrate outlier detection.
6. Initial Parameters

o Algorithms like k-means are sensitive to the initial choice of centroids.


o Bad initialization can lead to poor clustering.
o Solution: Use k-means++ or multiple restarts.
7. Evaluation of Clustering

o No ground truth in unsupervised learning.


o Hard to evaluate clustering objectively.
o Solution: Use internal metrics (e.g., silhouette score) or external metrics (if
labels exist).

10(a) Explain different types of Web Mining: web content mining, web
structure mining, and web usage mining?

Web mining is the process of discovering useful patterns, knowledge, and insights from the
web and its components. It is a subset of data mining and focuses on analyzing web data.

There are three main types of web mining:

Web Content Mining

Web Content Mining refers to the extraction of useful information or knowledge from the
content of web pages.

Content Includes:

 Text (HTML, PDFs, Word files)


 Images
 Audio & Video
 Structured data (e.g., tables, lists)
 Semi-structured data (e.g., XML, JSON)

Techniques Used:
 Text mining / Natural Language Processing (NLP)
 Information retrieval
 Classification and clustering
 Topic modeling (e.g., LDA)

Applications:

 Search engines (e.g., ranking and indexing content)


 Sentiment analysis of product reviews
 News and blog categorization
 Content recommendation systems

Web Structure Mining

Web Structure Mining is the process of analyzing the link structure of the web to discover
relationships between web pages.

 Hyperlinks (inter-document structure)


 Document structure (intra-document structure like DOM tree)

Techniques Used:

 Graph theory
 Social network analysis
 Algorithms like PageRank, HITS (Hyperlink-Induced Topic Search)

Applications:

 Ranking web pages (e.g., Google PageRank)


 Identifying authority and hub pages
 Detecting communities or clusters of related sites
 Web crawling and indexing strategies

Web Usage Mining

Web Usage Mining is the process of analyzing user behavior based on web log data.

Data Sources:

 Web server logs


 Browser logs
 Cookies
 User profiles
 Clickstream data
 Google Analytics

Techniques Used:

 Sequence mining
 Clustering and classification
 Association rule mining
 Path analysis

Applications:

 Personalized recommendations (e.g., Amazon, Netflix)


 Web traffic analysis
 Website optimization
 Targeted advertising
 Fraud detection

b) Write applications of Web Mining in e-commerce and search engines?

Web mining plays a critical role in enhancing the functionality, personalization, and
performance of e-commerce platforms and search engines. Below are key applications in
each domain:

Applications in E-Commerce

Web mining helps e-commerce businesses understand customer behavior, optimize


recommendations, and boost sales.

a) Personalized Product Recommendations

 Technique Used: Web usage mining


 How it works: Analyzes user behavior, purchase history, and clickstream data to
suggest products tailored to individual preferences.
 Example: Amazon recommending products based on browsing and past purchases.

b) Customer Segmentation

 Technique Used: Web usage & content mining


 How it works: Clustering customers into segments (e.g., high spenders, deal seekers)
for targeted marketing.
 Example: Sending different promotions to different customer groups.

C) Dynamic Pricing and Promotions

 Technique Used: Web usage mining + predictive analytics


 How it works: Adjusts product prices or promotions based on demand trends, user
interest, or competitor pricing.
 Example: Real-time discount suggestions based on product popularity.

Applications in Search Engines

Search engines rely heavily on web mining for indexing, ranking, and understanding user
intent.
a) Page Ranking

 Technique Used: Web structure mining


 How it works: Uses algorithms like PageRank and HITS to rank pages based on
link popularity and structure.
 Example: Google ranks authoritative pages higher in search results.

b) Query Suggestion and Auto-Completion

 Technique Used: Web usage mining


 How it works: Analyzes previous queries and user behavior to predict and suggest
next words or phrases.
 Example: Google suggesting "best phones under ₹30000" after typing "best phones
under".

c) Search Result Personalization

 Technique Used: Web usage mining + user profiling


 How it works: Tailors search results based on a user's location, history, preferences,
and device.
 Example: Showing different results for the same query based on user location.

OR

11(a) Define text mining. Explain various techniques for text clustering and
text categorization?

Text Mining (also called text data mining or text analytics) is the process of extracting useful
information and knowledge from unstructured text data. It involves transforming text into
structured data to uncover patterns, trends, or insights that can support decision-making.

Unlike traditional data mining, which works on structured data, text mining deals with natural
language texts like emails, social media posts, reports, articles, etc. It applies techniques from
linguistics, machine learning, and statistics to analyze and interpret textual content.

Techniques for Text Clustering and Text Categorization

1. Text Clustering

Text Clustering is an unsupervised learning technique that groups a set of text documents
into clusters such that documents within the same cluster are more similar to each other than
to those in other clusters. It’s useful for exploring large text corpora and discovering inherent
groupings without predefined labels.

Common techniques:
 K-means Clustering:
It partitions documents into k clusters by minimizing the variance within each cluster.
The documents are typically represented as vectors using methods like TF-IDF or
word embeddings.
Pros: Simple, efficient for large datasets
Cons: Requires specifying number of clusters k in advance
 Hierarchical Clustering:
Builds a tree (dendrogram) of clusters by either agglomerative (bottom-up) or divisive
(top-down) approaches. It does not require specifying the number of clusters upfront.
Pros: Easy to visualize, flexible
Cons: Computationally expensive for large datasets
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Clusters documents based on density of points in space, can find arbitrarily shaped
clusters, and identify noise/outliers.
Pros: Does not require specifying number of clusters, handles noise
Cons: Parameters can be hard to tune
 Self-Organizing Maps (SOM):
Neural network-based method that maps high-dimensional data to lower dimensions
preserving topology.
Pros: Good for visualization, capturing complex patterns
Cons: Requires expertise, computationally intensive

2. Text Categorization (Text Classification)

Text Categorization is a supervised learning technique where predefined categories or labels


are assigned to text documents. The system learns from labeled examples and then predicts
categories for new texts.

Common techniques:

 Naive Bayes Classifier:


A probabilistic classifier based on Bayes’ theorem assuming feature independence.
Works well for spam filtering, sentiment analysis, etc.
Pros: Fast, simple, effective on many text tasks
Cons: Strong independence assumption may reduce accuracy
 Support Vector Machines (SVM):
Finds the hyperplane that best separates the classes in the feature space. Widely used
due to good performance on text data.
Pros: High accuracy, effective with high-dimensional data
Cons: Can be slow on very large datasets
 Decision Trees and Random Forests:
Tree-based models that split data based on feature values. Random Forests use
ensembles of trees to improve robustness.
Pros: Interpretable, handles nonlinear data
Cons: May overfit if not tuned properly
 Neural Networks and Deep Learning (e.g., CNNs, RNNs, Transformers):
Models that learn hierarchical representations and capture complex relationships in
text, especially effective with large datasets.
Pros: State-of-the-art accuracy, can capture context and semantics
Cons: Requires large labeled data, computationally expensive
 Logistic Regression:
A linear model used for binary or multi-class classification. Often used as a baseline
method.
Pros: Simple, interpretable
Cons: Limited in capturing complex patterns

(b) Discuss the role of episode rule discovery in unstructured text mining?

Episode Rule Discovery is a data mining technique primarily used to discover patterns of
events or occurrences over time in a sequence. When applied to unstructured text mining,
episode rule discovery helps uncover meaningful temporal or sequential relationships
between concepts, phrases, or events mentioned in text data.

 Episode: A sequence or pattern of events occurring in a specific order or with


temporal constraints. For example, in text, an episode could be the co-occurrence or
sequence of keywords or topics that tend to appear together or in a particular order.
 Episode Rule Discovery involves finding rules like:
“If event A occurs, then event B follows within a certain time or document window.”
For example, in a medical report, "fever" followed by "rash" might form an episode
indicating a particular diagnosis.

Role in Unstructured Text Mining

1. Discovering Sequential Patterns:


Text data often contains sequences of concepts, events, or actions. Episode rule
discovery helps identify these sequences, revealing how one event or term leads to
another in the text narrative.
2. Temporal Relationships:
In documents such as news reports, logs, or medical records, the order and timing of
events matter. Episode rules capture these temporal dependencies, which pure
frequency-based approaches might miss.
3. Improving Text Understanding:
By recognizing episodes, the mining system gains deeper insight into the structure
and semantics of text, allowing for better summarization, trend detection, or
prediction.
4. Event Correlation:
Helps in correlating multiple events within text, which can be critical for domains like
fraud detection, medical diagnosis, or social media analysis.
5. Filtering Noise:
Unstructured text is noisy and vast. Episode rule discovery can filter out irrelevant
data by focusing on significant event sequences that hold meaningful patterns.

Example Use Cases

 Medical Text Mining:


Detecting patterns like symptom progression or treatment sequences in clinical notes.
 Customer Feedback Analysis:
Finding sequences of complaints or product issues over time.
 Security and Fraud Detection:
Discovering suspicious sequences of actions described in logs or emails.
 Narrative and Story Analysis:
Understanding plot structures or event sequences in literature or news articles.

THE END

You might also like