0% found this document useful (0 votes)

8 views13 pages

Data Mining Assignment

The document outlines methods for calculating dissimilarity matrices using nominal, ordinal, numeric, and mixed attributes from a dataset. It also discusses statistical analysis of age data, including midrange, quartiles, and boxplots, as well as distance calculations between data tuples. Additionally, it covers data cleaning and feature representation techniques using Boolean and TF methods for text data.

Uploaded by

Mahnoor Mir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views13 pages

Data Mining Assignment

Uploaded by

Mahnoor Mir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1.

Using following Data, Find Dissimilarity Matrices using:

a. Nominal Attributes

b. Using Ordinal Attributes

c. Using Numeric Attributes

d. Using all types of attributes (mixed type)

Solution:

import pandas as pd

import numpy as np

from scipy.spatial import distance

# Create the dataframe from the given table

data = {

'Object Identifier': [1, 2, 3, 4],

'test-1 (nominal)': ['code A', 'code B', 'code C', 'code A'],

'test-2 (ordinal)': ['excellent', 'fair', 'good', 'excellent'],

'test-3 (numeric)': [45, 22, 64, 28]

}

df = pd.DataFrame(data)

# Define the order for ordinal data

ordinal_order = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3

# a. Dissimilarity matrix for nominal attributes (test-1)

def nominal_dissimilarity(df, column):

unique_values = df[column].unique()

n = len(df)

dissimilarity = np.zeros((n, n))

for i in range(n):

for j in range(n):

dissimilarity[i,j] = 0 if df[column][i] == df[column][j] else 1

return dissimilarity

nominal_dissim = nominal_dissimilarity(df, 'test-1 (nominal)')

print("a. Dissimilarity matrix for nominal attributes (test-1):")

print(nomnal_dissim)

print("\n")

# b. Dissimilarity matrix for ordinal attributes (test-2)

def ordinal_dissimilarity(df, column, order_mapping):

n = len(df)
dissimilarity = np.zeros((n, n))

# Convert ordinal to numeric

numeric_values = df[column].map(order_mapping)

max_val = max(order_mapping.values())

min_val = min(order_mapping.values())

for i in range(n):

for j in range(n):

dissimilarity[i,j] = abs(numeric_values[i] - numeric_values[j]) / (max_val - min_val)

return dissimilarity

ordinal_dissim = ordinal_dissimilarity(df, 'test-2 (ordinal)', ordinal_order)

print("b. Dissimilarity matrix for ordinal attributes (test-2):")

print(ordinal_dissim)

print("\n")

# c. Dissimilarity matrix for numeric attributes (test-3)

def numeric_dissimilarity(df, column):

values = df[column].values.reshape(-1, 1)

dissimilarity = distance.pdist(values, 'euclidean')

dissimilarity = distance.squareform(dissimilarity)

# Normalize by max distance

max_dist = np.max(dissimilarity)
if max_dist > 0:

dissimilarity = dissimilarity / max_dist

return dissimilarity

numeric_dissim = numeric_dissimilarity(df, 'test-3 (numeric)')

print("c. Dissimilarity matrix for numeric attributes (test-3):")

print(numeric_dissim)

print("\n")

# d. Dissimilarity matrix for mixed attributes

def mixed_dissimilarity(df, nominal_cols, ordinal_cols, numeric_cols, ordinal_order):

n = len(df)

total_dissim = np.zeros((n, n))

weights = {'nominal': 1/3, 'ordinal': 1/3, 'numeric': 1/3} # Equal weights

# Nominal contribution

for col in nominal_cols:

dissim = nominal_dissimilarity(df, col)

total_dissim += weights['nominal'] * dissim

# Ordinal contribution

for col in ordinal_cols:

dissim = ordinal_dissimilarity(df, col, ordinal_order)

total_dissim += weights['ordinal'] * dissim

# Numeric contribution

for col in numeric_cols:

dissim = numeric_dissimilarity(df, col)

total_dissim += weights['numeric'] * dissim

return total_dissim

mixed_dissim = mixed_dissimilarity(

df,

nominal_cols=['test-1 (nominal)'],

ordinal_cols=['test-2 (ordinal)'],

numeric_cols=['test-3 (numeric)'],

ordinal_order=ordinal_order

print("d. Dissimilarity matrix for mixed attributes:")

print(mixed_dissim)

2. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are
(in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70.

a. What is the midrange of the data?

b. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

c. Give the five-number summary of the data.

d. Show a boxplot of the data.

e. How is a quantile–quantile plot different from a quantile plot?

Solution: Analysis of Age Data

• Midrange: The midrange is the average of the minimum and maximum values in the dataset.

o Minimum value: 13

o Maximum value: 70

o Midrange = (13 + 70) / 2 =

• Quartiles (Q1 and Q3):

o There are 27 data points in total.

o First Quartile (Q1): The first quartile is the value at position (27+1)∗0.25=7. The 7th value in
the ordered list is

o Third Quartile (Q3): The third quartile is the value at position (27+1)∗0.75=21. The 21st value
in the ordered list is

• Five-Number Summary: This summary consists of the minimum, Q1, median, Q3, and maximum values.

o Minimum:

o Q1:

o Median (Q2): The median is the middle value. In this dataset of 27 values, the median is the
14th value, which is

o Q3:

o Maximum:

• Quantile-Quantile Plot vs. Quantile Plot:

o A

quantile plot is a graphical method for displaying all of the data, where each value is paired with its quantile.

o A quantile-quantile plot (Q-Q plot) compares the quantiles of a probability distribution to the
quantiles of a different distribution. It's used to determine if two datasets come from
populations with a common distribution.
3. Suppose that the values for a given set of data are grouped into intervals. The intervals and
corresponding frequencies are as follows:

Compute an approximate median value for the data.

Solution: Approximate Median for Grouped Data

• Total Frequency: The total number of data points is the sum of all frequencies.

o Total frequency =

200+450+300+1500+700+44=3194.

• Median Position: The median is the middle value, which is at position 3194/2=1597.

• Median Group: We need to find which interval the 1597th value falls into by summing the frequencies.

o Interval 1-5: 200

o Interval 6-15: 200+450=650

o Interval 16-20: 650+300=950

o Interval 21-50: 950+1500=2450. The median position (1597) falls within this group.

• Approximate Median Calculation:

o The formula for the approximate median of grouped data is: Median≈L1+freqmedian(2N
−(∑freq)l)×width

o L1 (lower boundary of median group) = 21

o N (total frequency) = 3194

o (∑freq)l (sum of frequencies of groups below the median group) = 200+450+300=950

o freqmedian (frequency of median group) = 1500

o $ width $ (width of median group) = 50−21=29

o Median≈21+1500(23194−950)×29=21+1500(1597−950)×29

o Median≈21+1500647×29≈21+0.4313×29≈21+12.508≈33.5.

4. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the
following results:

(a) Calculate the mean, median, and standard deviation of age and %fat.

(b) Draw the boxplots for age and %fat.

(c) Draw a scatter plot based on these two variables.

4. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

(a) Compute the Euclidean distance between the two objects.

(b) Compute the Manhattan distance between the two objects. (c) Compute the Minkowski distance
between the two objects, using q = 3.

(d) Compute the supremum distance between the two objects.

Solution: a) The Euclidean distance is the straight-line distance between two points in Euclidean space. It is
calculated as the square root of the sum of the squared differences between the
corresponding elements of the tuples. For the given tuples X=(22,1,42,10) and
Y=(20,0,36,8):

d(X,Y=(22−20)2+(1−0)2+(42−36)2+(10−8)2
d(X,)=(2)2+(1)2+(6)2+(2)2

d(X,Y)=4+1+36+4 =45 ≈6.708

(b) The Manhattan distance (also known as city block distance or L1 norm) is the sum of the absolute
differences of the Cartesian coordinates of the tuples.

d(X,Y)=∣22−20∣+∣1−0∣+∣42−36∣+∣10−8∣

d(X,Y)=∣2∣+∣1∣+∣6∣+∣2∣=2+1+6+2=11

(c) The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. For q=3, the
formula is the q-th root of the sum of the q-th powers of the absolute differences of the
coordinates.

d(X,Y)=(∣22−20∣3+∣1−0∣3+∣42−36∣3+∣10−8∣3)1/3

d(X,Y)=(23+13+63+23)1/3

d(X,Y)=(8+1+216+8)1/3=(233)1/3≈6.153

(d) The supremum distance (also known as Chebyshev distance or L-infinity norm) is the maximum of the
absolute differences between the corresponding elements of the tuples.

d(X,Y)=max(∣22−20∣,∣1−0∣,∣42−36∣,∣10−8∣)

d(X,Y)=max(2,1,,2)=6

5. Consider the following text :

credibility he India-Pakistan conflict (2025)

Main article: 2025 India–Pakistan conflict

The 2025 India–Pakistan conflict was a brief armed conflict between India and Pakistan that began on 7 May
2025, after India launched missile strikes on Pakistan, codenamed Operation Sindoor.
India stated that the operation was a response to the Pahalgam attack on 22 April by
militants in the Indian administered Kashmir killing 26 civilians, mostly tourists. The
attack intensified tensions between India and Pakistan as India accused Pakistan of
supporting cross-border terrorism, which Pakistan denied.

Apply Data cleaning and use Boolean and TF method for Feature representation. Consider the Unigrams as
features.

Solution:

import re

from collections import defaultdict

import pandas as pd

# Original text

text = """

credibility he India-Pakistan conflict (2025)

Main article: 2025 India–Pakistan conflict

The 2025 India–Pakistan conflict was a brief armed conflict between India and Pakistan that began on 7 May
2025, after India launched missile strikes on Pakistan, codenamed Operation Sindoor. India
stated that the operation was a response to the Pahalgam attack on 22 April by militants
in the Indian administered Kashmir killing 26 civilians, mostly tourists. The attack
intensified tensions between India and Pakistan as India accused Pakistan of supporting
cross-border terrorism, which Pakistan denied.

"""

# Data Cleaning

def clean_text(text):

# Convert to lowercase

text = text.lower()

# Remove punctuation and special characters

text = re.sub(r'[^\w\s]', '', text)

# Remove numbers

text = re.sub(r'\d+', '', text)

# Remove extra whitespace

text = ' '.join(text.split())

return text

cleaned_text = clean_text(text)

print("Cleaned Text:")

print(cleaned_text)

print("\n")

# Tokenization and unigram extraction

tokens = cleaned_text.split()

unigrams = set(tokens) # Get unique unigrams

print("Unique Unigrams (Features):")

print(unigrams)

print("\n")

# Boolean Feature Representation

def boolean_vector(text, features):

text_tokens = text.split()

vector = []

for feature in sorted(features):

vector.append(1 if feature in text_tokens else 0)

return vector
# TF Feature Representation

def tf_vector(text, features):

text_tokens = text.split()

token_counts = defaultdict(int)

for token in text_tokens:

token_counts[token] += 1

max_count = max(token_counts.values()) if token_counts else 1

vector = []

for feature in sorted(features):

count = token_counts.get(feature, 0)

# Normalize by max count in document (alternative: could use raw counts)

normalized = count / max_count

vector.append(normalized)

return vector

# Create feature representations

features = sorted(unigrams)

boolean_repr = boolean_vector(cleaned_text, features)

tf_repr = tf_vector(cleaned_text, features)

# Create a DataFrame for better visualization

df = pd.DataFrame({

'Unigram': sorted(features),
'Boolean': boolean_repr,

'TF': tf_repr

})

print("Feature Representations:")

print(df.to_string(index=False))

Data Mining Solution
No ratings yet
Data Mining Solution
7 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
2024 CISA Study Text
No ratings yet
2024 CISA Study Text
330 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
csp201 21 10 2023
No ratings yet
csp201 21 10 2023
4 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
CSC 240 HW 2
No ratings yet
CSC 240 HW 2
5 pages
Formulas at A Glance - IDS
No ratings yet
Formulas at A Glance - IDS
5 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
43 pages
Data Mining
No ratings yet
Data Mining
24 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
Proximity Measures in Data Mining and Machine Learning
No ratings yet
Proximity Measures in Data Mining and Machine Learning
4 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
No 2
No ratings yet
No 2
2 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
No ratings yet
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
5 pages
Lec 5
No ratings yet
Lec 5
24 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
17 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
EFI Fuel System
No ratings yet
EFI Fuel System
68 pages
Quiz2 Source
No ratings yet
Quiz2 Source
8 pages
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
100% (1)
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
16 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Statistics Exp 1
100% (1)
Statistics Exp 1
15 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Assignment DMBI 2
No ratings yet
Assignment DMBI 2
2 pages
Evolution of Media
100% (1)
Evolution of Media
8 pages
Predictive Numericals 20 Questions
No ratings yet
Predictive Numericals 20 Questions
4 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
2 pages
Similarity Computation of Categrical and Ordinal Data
No ratings yet
Similarity Computation of Categrical and Ordinal Data
11 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
Specialized Business Information Systems
0% (1)
Specialized Business Information Systems
34 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Assign 1
No ratings yet
Assign 1
1 page
Da Lab It
No ratings yet
Da Lab It
20 pages
AI & Stats Lab Exercises
No ratings yet
AI & Stats Lab Exercises
13 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
6 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Bda
No ratings yet
Bda
24 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Practical 10
No ratings yet
Practical 10
22 pages
AFP FilterPress Brochure PDF
No ratings yet
AFP FilterPress Brochure PDF
4 pages
DLL Ict 10
100% (1)
DLL Ict 10
3 pages
Assignment I
No ratings yet
Assignment I
4 pages
Lab 2
No ratings yet
Lab 2
21 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
10 pages
Data Cleaning & Integration Guide
No ratings yet
Data Cleaning & Integration Guide
21 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
YB4408 Manual de Partes PDF
100% (1)
YB4408 Manual de Partes PDF
533 pages
USAID/BHA Resilience Food Security Guide
No ratings yet
USAID/BHA Resilience Food Security Guide
143 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
AI in Fashion Market - Segmentation Detailed Study With Forecast - Facts and Trends
No ratings yet
AI in Fashion Market - Segmentation Detailed Study With Forecast - Facts and Trends
2 pages
CV Jahanzaib 02
No ratings yet
CV Jahanzaib 02
5 pages
Learnhive - CBSE Grade 5 Science Human Body - Lessons, Exercises, and Practice Tests
No ratings yet
Learnhive - CBSE Grade 5 Science Human Body - Lessons, Exercises, and Practice Tests
9 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
Latest Dissertation Topics For Mba Marketing
100% (2)
Latest Dissertation Topics For Mba Marketing
7 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
National Institute of Technology, Tiruchirappalli MBA Trimester Examination, Basic Data Analytic Marathon Exam
No ratings yet
National Institute of Technology, Tiruchirappalli MBA Trimester Examination, Basic Data Analytic Marathon Exam
22 pages
Rtu PDF
No ratings yet
Rtu PDF
13 pages
Product Installation Manual: Digital Video Disc Player Model #'S: DVD-01x-x, DVD-01x-40x Document #540066
No ratings yet
Product Installation Manual: Digital Video Disc Player Model #'S: DVD-01x-x, DVD-01x-40x Document #540066
17 pages
SEO Directory and Bookmarking List
No ratings yet
SEO Directory and Bookmarking List
6 pages
Shaft Design
No ratings yet
Shaft Design
14 pages
Assignment 1 - Linear Programming I - With Answers
No ratings yet
Assignment 1 - Linear Programming I - With Answers
2 pages
On The Sidewalk Bleeding Essay
100% (2)
On The Sidewalk Bleeding Essay
8 pages
Fire Safety Report for Asansika Hostel
No ratings yet
Fire Safety Report for Asansika Hostel
5 pages
Fish-Ridge Wind Turbine
No ratings yet
Fish-Ridge Wind Turbine
19 pages
MTE 2223 08 Mar 2025
No ratings yet
MTE 2223 08 Mar 2025
8 pages
ILogic and The Inventor API
No ratings yet
ILogic and The Inventor API
20 pages
Olp 34 35 38 Optical Power Meter Manual User Guide en
No ratings yet
Olp 34 35 38 Optical Power Meter Manual User Guide en
36 pages
List Mechanical Procedure Qualification Test (API 1104) 2018 (CEPU)
No ratings yet
List Mechanical Procedure Qualification Test (API 1104) 2018 (CEPU)
5 pages
Rheotherm: Circulating Water Flow and Fouling Sensor
No ratings yet
Rheotherm: Circulating Water Flow and Fouling Sensor
2 pages
Mobile Communications Networks - Midterm Exam - Feb 2025
No ratings yet
Mobile Communications Networks - Midterm Exam - Feb 2025
4 pages
Failover-Clustering Windows Server
No ratings yet
Failover-Clustering Windows Server
89 pages
Deep Learning Based Smart Garbage Classifier For Effective Waste Management
No ratings yet
Deep Learning Based Smart Garbage Classifier For Effective Waste Management
4 pages
Rohini 59125306424
No ratings yet
Rohini 59125306424
5 pages
Pipe Risers and Their Supports
No ratings yet
Pipe Risers and Their Supports
4 pages