1.
Using following Data, Find Dissimilarity Matrices using:
a. Nominal Attributes
b. Using Ordinal Attributes
c. Using Numeric Attributes
d. Using all types of attributes (mixed type)
Solution:
import pandas as pd
import numpy as np
from scipy.spatial import distance
# Create the dataframe from the given table
data = {
'Object Identifier': [1, 2, 3, 4],
'test-1 (nominal)': ['code A', 'code B', 'code C', 'code A'],
'test-2 (ordinal)': ['excellent', 'fair', 'good', 'excellent'],
'test-3 (numeric)': [45, 22, 64, 28]
}
df = pd.DataFrame(data)
# Define the order for ordinal data
ordinal_order = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3
# a. Dissimilarity matrix for nominal attributes (test-1)
def nominal_dissimilarity(df, column):
unique_values = df[column].unique()
n = len(df)
dissimilarity = np.zeros((n, n))
for i in range(n):
for j in range(n):
dissimilarity[i,j] = 0 if df[column][i] == df[column][j] else 1
return dissimilarity
nominal_dissim = nominal_dissimilarity(df, 'test-1 (nominal)')
print("a. Dissimilarity matrix for nominal attributes (test-1):")
print(nomnal_dissim)
print("\n")
# b. Dissimilarity matrix for ordinal attributes (test-2)
def ordinal_dissimilarity(df, column, order_mapping):
n = len(df)
dissimilarity = np.zeros((n, n))
# Convert ordinal to numeric
numeric_values = df[column].map(order_mapping)
max_val = max(order_mapping.values())
min_val = min(order_mapping.values())
for i in range(n):
for j in range(n):
dissimilarity[i,j] = abs(numeric_values[i] - numeric_values[j]) / (max_val - min_val)
return dissimilarity
ordinal_dissim = ordinal_dissimilarity(df, 'test-2 (ordinal)', ordinal_order)
print("b. Dissimilarity matrix for ordinal attributes (test-2):")
print(ordinal_dissim)
print("\n")
# c. Dissimilarity matrix for numeric attributes (test-3)
def numeric_dissimilarity(df, column):
values = df[column].values.reshape(-1, 1)
dissimilarity = distance.pdist(values, 'euclidean')
dissimilarity = distance.squareform(dissimilarity)
# Normalize by max distance
max_dist = np.max(dissimilarity)
if max_dist > 0:
dissimilarity = dissimilarity / max_dist
return dissimilarity
numeric_dissim = numeric_dissimilarity(df, 'test-3 (numeric)')
print("c. Dissimilarity matrix for numeric attributes (test-3):")
print(numeric_dissim)
print("\n")
# d. Dissimilarity matrix for mixed attributes
def mixed_dissimilarity(df, nominal_cols, ordinal_cols, numeric_cols, ordinal_order):
n = len(df)
total_dissim = np.zeros((n, n))
weights = {'nominal': 1/3, 'ordinal': 1/3, 'numeric': 1/3} # Equal weights
# Nominal contribution
for col in nominal_cols:
dissim = nominal_dissimilarity(df, col)
total_dissim += weights['nominal'] * dissim
# Ordinal contribution
for col in ordinal_cols:
dissim = ordinal_dissimilarity(df, col, ordinal_order)
total_dissim += weights['ordinal'] * dissim
# Numeric contribution
for col in numeric_cols:
dissim = numeric_dissimilarity(df, col)
total_dissim += weights['numeric'] * dissim
return total_dissim
mixed_dissim = mixed_dissimilarity(
df,
nominal_cols=['test-1 (nominal)'],
ordinal_cols=['test-2 (ordinal)'],
numeric_cols=['test-3 (numeric)'],
ordinal_order=ordinal_order
print("d. Dissimilarity matrix for mixed attributes:")
print(mixed_dissim)
2. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are
(in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70.
a. What is the midrange of the data?
b. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
c. Give the five-number summary of the data.
d. Show a boxplot of the data.
e. How is a quantile–quantile plot different from a quantile plot?
Solution: Analysis of Age Data
• Midrange: The midrange is the average of the minimum and maximum values in the dataset.
o Minimum value: 13
o Maximum value: 70
o Midrange = (13 + 70) / 2 =
• Quartiles (Q1 and Q3):
o There are 27 data points in total.
o First Quartile (Q1): The first quartile is the value at position (27+1)∗0.25=7. The 7th value in
the ordered list is
o Third Quartile (Q3): The third quartile is the value at position (27+1)∗0.75=21. The 21st value
in the ordered list is
• Five-Number Summary: This summary consists of the minimum, Q1, median, Q3, and maximum values.
o Minimum:
o Q1:
o Median (Q2): The median is the middle value. In this dataset of 27 values, the median is the
14th value, which is
o Q3:
o Maximum:
• Quantile-Quantile Plot vs. Quantile Plot:
o A
quantile plot is a graphical method for displaying all of the data, where each value is paired with its quantile.
o A quantile-quantile plot (Q-Q plot) compares the quantiles of a probability distribution to the
quantiles of a different distribution. It's used to determine if two datasets come from
populations with a common distribution.
3. Suppose that the values for a given set of data are grouped into intervals. The intervals and
corresponding frequencies are as follows:
Compute an approximate median value for the data.
Solution: Approximate Median for Grouped Data
• Total Frequency: The total number of data points is the sum of all frequencies.
o Total frequency =
200+450+300+1500+700+44=3194.
• Median Position: The median is the middle value, which is at position 3194/2=1597.
• Median Group: We need to find which interval the 1597th value falls into by summing the frequencies.
o Interval 1-5: 200
o Interval 6-15: 200+450=650
o Interval 16-20: 650+300=950
o Interval 21-50: 950+1500=2450. The median position (1597) falls within this group.
• Approximate Median Calculation:
o The formula for the approximate median of grouped data is: Median≈L1+freqmedian(2N
−(∑freq)l)×width
o L1 (lower boundary of median group) = 21
o N (total frequency) = 3194
o (∑freq)l (sum of frequencies of groups below the median group) = 200+450+300=950
o freqmedian (frequency of median group) = 1500
o $ width $ (width of median group) = 50−21=29
o Median≈21+1500(23194−950)×29=21+1500(1597−950)×29
o Median≈21+1500647×29≈21+0.4313×29≈21+12.508≈33.5.
4. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the
following results:
(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot based on these two variables.
4. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects. (c) Compute the Minkowski distance
between the two objects, using q = 3.
(d) Compute the supremum distance between the two objects.
Solution: a) The Euclidean distance is the straight-line distance between two points in Euclidean space. It is
calculated as the square root of the sum of the squared differences between the
corresponding elements of the tuples. For the given tuples X=(22,1,42,10) and
Y=(20,0,36,8):
d(X,Y=(22−20)2+(1−0)2+(42−36)2+(10−8)2
d(X,)=(2)2+(1)2+(6)2+(2)2
d(X,Y)=4+1+36+4 =45 ≈6.708
(b) The Manhattan distance (also known as city block distance or L1 norm) is the sum of the absolute
differences of the Cartesian coordinates of the tuples.
d(X,Y)=∣22−20∣+∣1−0∣+∣42−36∣+∣10−8∣
d(X,Y)=∣2∣+∣1∣+∣6∣+∣2∣=2+1+6+2=11
(c) The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. For q=3, the
formula is the q-th root of the sum of the q-th powers of the absolute differences of the
coordinates.
d(X,Y)=(∣22−20∣3+∣1−0∣3+∣42−36∣3+∣10−8∣3)1/3
d(X,Y)=(23+13+63+23)1/3
d(X,Y)=(8+1+216+8)1/3=(233)1/3≈6.153
(d) The supremum distance (also known as Chebyshev distance or L-infinity norm) is the maximum of the
absolute differences between the corresponding elements of the tuples.
d(X,Y)=max(∣22−20∣,∣1−0∣,∣42−36∣,∣10−8∣)
d(X,Y)=max(2,1,,2)=6
5. Consider the following text :
credibility he India-Pakistan conflict (2025)
Main article: 2025 India–Pakistan conflict
The 2025 India–Pakistan conflict was a brief armed conflict between India and Pakistan that began on 7 May
2025, after India launched missile strikes on Pakistan, codenamed Operation Sindoor.
India stated that the operation was a response to the Pahalgam attack on 22 April by
militants in the Indian administered Kashmir killing 26 civilians, mostly tourists. The
attack intensified tensions between India and Pakistan as India accused Pakistan of
supporting cross-border terrorism, which Pakistan denied.
Apply Data cleaning and use Boolean and TF method for Feature representation. Consider the Unigrams as
features.
Solution:
import re
from collections import defaultdict
import pandas as pd
# Original text
text = """
credibility he India-Pakistan conflict (2025)
Main article: 2025 India–Pakistan conflict
The 2025 India–Pakistan conflict was a brief armed conflict between India and Pakistan that began on 7 May
2025, after India launched missile strikes on Pakistan, codenamed Operation Sindoor. India
stated that the operation was a response to the Pahalgam attack on 22 April by militants
in the Indian administered Kashmir killing 26 civilians, mostly tourists. The attack
intensified tensions between India and Pakistan as India accused Pakistan of supporting
cross-border terrorism, which Pakistan denied.
"""
# Data Cleaning
def clean_text(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation and special characters
text = re.sub(r'[^\w\s]', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
cleaned_text = clean_text(text)
print("Cleaned Text:")
print(cleaned_text)
print("\n")
# Tokenization and unigram extraction
tokens = cleaned_text.split()
unigrams = set(tokens) # Get unique unigrams
print("Unique Unigrams (Features):")
print(unigrams)
print("\n")
# Boolean Feature Representation
def boolean_vector(text, features):
text_tokens = text.split()
vector = []
for feature in sorted(features):
vector.append(1 if feature in text_tokens else 0)
return vector
# TF Feature Representation
def tf_vector(text, features):
text_tokens = text.split()
token_counts = defaultdict(int)
for token in text_tokens:
token_counts[token] += 1
max_count = max(token_counts.values()) if token_counts else 1
vector = []
for feature in sorted(features):
count = token_counts.get(feature, 0)
# Normalize by max count in document (alternative: could use raw counts)
normalized = count / max_count
vector.append(normalized)
return vector
# Create feature representations
features = sorted(unigrams)
boolean_repr = boolean_vector(cleaned_text, features)
tf_repr = tf_vector(cleaned_text, features)
# Create a DataFrame for better visualization
df = pd.DataFrame({
'Unigram': sorted(features),
'Boolean': boolean_repr,
'TF': tf_repr
})
print("Feature Representations:")
print(df.to_string(index=False))