Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views13 pages

Data Mining Assignment

The document outlines methods for calculating dissimilarity matrices using nominal, ordinal, numeric, and mixed attributes from a dataset. It also discusses statistical analysis of age data, including midrange, quartiles, and boxplots, as well as distance calculations between data tuples. Additionally, it covers data cleaning and feature representation techniques using Boolean and TF methods for text data.

Uploaded by

Mahnoor Mir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

Data Mining Assignment

The document outlines methods for calculating dissimilarity matrices using nominal, ordinal, numeric, and mixed attributes from a dataset. It also discusses statistical analysis of age data, including midrange, quartiles, and boxplots, as well as distance calculations between data tuples. Additionally, it covers data cleaning and feature representation techniques using Boolean and TF methods for text data.

Uploaded by

Mahnoor Mir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1.

Using following Data, Find Dissimilarity Matrices using:

a. Nominal Attributes

b. Using Ordinal Attributes

c. Using Numeric Attributes

d. Using all types of attributes (mixed type)

Solution:

import pandas as pd

import numpy as np

from scipy.spatial import distance

# Create the dataframe from the given table

data = {

'Object Identifier': [1, 2, 3, 4],

'test-1 (nominal)': ['code A', 'code B', 'code C', 'code A'],

'test-2 (ordinal)': ['excellent', 'fair', 'good', 'excellent'],

'test-3 (numeric)': [45, 22, 64, 28]


}

df = pd.DataFrame(data)

# Define the order for ordinal data

ordinal_order = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3

# a. Dissimilarity matrix for nominal attributes (test-1)

def nominal_dissimilarity(df, column):

unique_values = df[column].unique()

n = len(df)

dissimilarity = np.zeros((n, n))

for i in range(n):

for j in range(n):

dissimilarity[i,j] = 0 if df[column][i] == df[column][j] else 1

return dissimilarity

nominal_dissim = nominal_dissimilarity(df, 'test-1 (nominal)')

print("a. Dissimilarity matrix for nominal attributes (test-1):")

print(nomnal_dissim)

print("\n")

# b. Dissimilarity matrix for ordinal attributes (test-2)

def ordinal_dissimilarity(df, column, order_mapping):

n = len(df)
dissimilarity = np.zeros((n, n))

# Convert ordinal to numeric

numeric_values = df[column].map(order_mapping)

max_val = max(order_mapping.values())

min_val = min(order_mapping.values())

for i in range(n):

for j in range(n):

dissimilarity[i,j] = abs(numeric_values[i] - numeric_values[j]) / (max_val - min_val)

return dissimilarity

ordinal_dissim = ordinal_dissimilarity(df, 'test-2 (ordinal)', ordinal_order)

print("b. Dissimilarity matrix for ordinal attributes (test-2):")

print(ordinal_dissim)

print("\n")

# c. Dissimilarity matrix for numeric attributes (test-3)

def numeric_dissimilarity(df, column):

values = df[column].values.reshape(-1, 1)

dissimilarity = distance.pdist(values, 'euclidean')

dissimilarity = distance.squareform(dissimilarity)

# Normalize by max distance

max_dist = np.max(dissimilarity)
if max_dist > 0:

dissimilarity = dissimilarity / max_dist

return dissimilarity

numeric_dissim = numeric_dissimilarity(df, 'test-3 (numeric)')

print("c. Dissimilarity matrix for numeric attributes (test-3):")

print(numeric_dissim)

print("\n")

# d. Dissimilarity matrix for mixed attributes

def mixed_dissimilarity(df, nominal_cols, ordinal_cols, numeric_cols, ordinal_order):

n = len(df)

total_dissim = np.zeros((n, n))

weights = {'nominal': 1/3, 'ordinal': 1/3, 'numeric': 1/3} # Equal weights

# Nominal contribution

for col in nominal_cols:

dissim = nominal_dissimilarity(df, col)

total_dissim += weights['nominal'] * dissim

# Ordinal contribution

for col in ordinal_cols:

dissim = ordinal_dissimilarity(df, col, ordinal_order)


total_dissim += weights['ordinal'] * dissim

# Numeric contribution

for col in numeric_cols:

dissim = numeric_dissimilarity(df, col)

total_dissim += weights['numeric'] * dissim

return total_dissim

mixed_dissim = mixed_dissimilarity(

df,

nominal_cols=['test-1 (nominal)'],

ordinal_cols=['test-2 (ordinal)'],

numeric_cols=['test-3 (numeric)'],

ordinal_order=ordinal_order

print("d. Dissimilarity matrix for mixed attributes:")

print(mixed_dissim)

2. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are
(in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70.

a. What is the midrange of the data?

b. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

c. Give the five-number summary of the data.

d. Show a boxplot of the data.

e. How is a quantile–quantile plot different from a quantile plot?


Solution: Analysis of Age Data

• Midrange: The midrange is the average of the minimum and maximum values in the dataset.

o Minimum value: 13

o Maximum value: 70

o Midrange = (13 + 70) / 2 =

• Quartiles (Q1 and Q3):

o There are 27 data points in total.

o First Quartile (Q1): The first quartile is the value at position (27+1)∗0.25=7. The 7th value in
the ordered list is

o Third Quartile (Q3): The third quartile is the value at position (27+1)∗0.75=21. The 21st value
in the ordered list is

• Five-Number Summary: This summary consists of the minimum, Q1, median, Q3, and maximum values.

o Minimum:

o Q1:

o Median (Q2): The median is the middle value. In this dataset of 27 values, the median is the
14th value, which is

o Q3:

o Maximum:

• Quantile-Quantile Plot vs. Quantile Plot:

o A

quantile plot is a graphical method for displaying all of the data, where each value is paired with its quantile.

o A quantile-quantile plot (Q-Q plot) compares the quantiles of a probability distribution to the
quantiles of a different distribution. It's used to determine if two datasets come from
populations with a common distribution.
3. Suppose that the values for a given set of data are grouped into intervals. The intervals and
corresponding frequencies are as follows:

Compute an approximate median value for the data.

Solution: Approximate Median for Grouped Data

• Total Frequency: The total number of data points is the sum of all frequencies.

o Total frequency =

200+450+300+1500+700+44=3194.

• Median Position: The median is the middle value, which is at position 3194/2=1597.

• Median Group: We need to find which interval the 1597th value falls into by summing the frequencies.

o Interval 1-5: 200

o Interval 6-15: 200+450=650

o Interval 16-20: 650+300=950

o Interval 21-50: 950+1500=2450. The median position (1597) falls within this group.

• Approximate Median Calculation:

o The formula for the approximate median of grouped data is: Median≈L1+freqmedian(2N
−(∑freq)l)×width

o L1 (lower boundary of median group) = 21

o N (total frequency) = 3194

o (∑freq)l (sum of frequencies of groups below the median group) = 200+450+300=950


o freqmedian (frequency of median group) = 1500

o $ width $ (width of median group) = 50−21=29

o Median≈21+1500(23194−950)×29=21+1500(1597−950)×29

o Median≈21+1500647×29≈21+0.4313×29≈21+12.508≈33.5.

4. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the
following results:

(a) Calculate the mean, median, and standard deviation of age and %fat.

(b) Draw the boxplots for age and %fat.

(c) Draw a scatter plot based on these two variables.

4. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

(a) Compute the Euclidean distance between the two objects.

(b) Compute the Manhattan distance between the two objects. (c) Compute the Minkowski distance
between the two objects, using q = 3.

(d) Compute the supremum distance between the two objects.

Solution: a) The Euclidean distance is the straight-line distance between two points in Euclidean space. It is
calculated as the square root of the sum of the squared differences between the
corresponding elements of the tuples. For the given tuples X=(22,1,42,10) and
Y=(20,0,36,8):

d(X,Y=(22−20)2+(1−0)2+(42−36)2+(10−8)2
d(X,)=(2)2+(1)2+(6)2+(2)2

d(X,Y)=4+1+36+4 =45 ≈6.708

(b) The Manhattan distance (also known as city block distance or L1 norm) is the sum of the absolute
differences of the Cartesian coordinates of the tuples.

d(X,Y)=∣22−20∣+∣1−0∣+∣42−36∣+∣10−8∣

d(X,Y)=∣2∣+∣1∣+∣6∣+∣2∣=2+1+6+2=11

(c) The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. For q=3, the
formula is the q-th root of the sum of the q-th powers of the absolute differences of the
coordinates.

d(X,Y)=(∣22−20∣3+∣1−0∣3+∣42−36∣3+∣10−8∣3)1/3

d(X,Y)=(23+13+63+23)1/3

d(X,Y)=(8+1+216+8)1/3=(233)1/3≈6.153

(d) The supremum distance (also known as Chebyshev distance or L-infinity norm) is the maximum of the
absolute differences between the corresponding elements of the tuples.

d(X,Y)=max(∣22−20∣,∣1−0∣,∣42−36∣,∣10−8∣)

d(X,Y)=max(2,1,,2)=6

5. Consider the following text :

credibility he India-Pakistan conflict (2025)

Main article: 2025 India–Pakistan conflict

The 2025 India–Pakistan conflict was a brief armed conflict between India and Pakistan that began on 7 May
2025, after India launched missile strikes on Pakistan, codenamed Operation Sindoor.
India stated that the operation was a response to the Pahalgam attack on 22 April by
militants in the Indian administered Kashmir killing 26 civilians, mostly tourists. The
attack intensified tensions between India and Pakistan as India accused Pakistan of
supporting cross-border terrorism, which Pakistan denied.

Apply Data cleaning and use Boolean and TF method for Feature representation. Consider the Unigrams as
features.

Solution:

import re

from collections import defaultdict

import pandas as pd

# Original text

text = """

credibility he India-Pakistan conflict (2025)

Main article: 2025 India–Pakistan conflict

The 2025 India–Pakistan conflict was a brief armed conflict between India and Pakistan that began on 7 May
2025, after India launched missile strikes on Pakistan, codenamed Operation Sindoor. India
stated that the operation was a response to the Pahalgam attack on 22 April by militants
in the Indian administered Kashmir killing 26 civilians, mostly tourists. The attack
intensified tensions between India and Pakistan as India accused Pakistan of supporting
cross-border terrorism, which Pakistan denied.

"""

# Data Cleaning

def clean_text(text):

# Convert to lowercase

text = text.lower()

# Remove punctuation and special characters

text = re.sub(r'[^\w\s]', '', text)


# Remove numbers

text = re.sub(r'\d+', '', text)

# Remove extra whitespace

text = ' '.join(text.split())

return text

cleaned_text = clean_text(text)

print("Cleaned Text:")

print(cleaned_text)

print("\n")

# Tokenization and unigram extraction

tokens = cleaned_text.split()

unigrams = set(tokens) # Get unique unigrams

print("Unique Unigrams (Features):")

print(unigrams)

print("\n")

# Boolean Feature Representation

def boolean_vector(text, features):

text_tokens = text.split()

vector = []

for feature in sorted(features):

vector.append(1 if feature in text_tokens else 0)

return vector
# TF Feature Representation

def tf_vector(text, features):

text_tokens = text.split()

token_counts = defaultdict(int)

for token in text_tokens:

token_counts[token] += 1

max_count = max(token_counts.values()) if token_counts else 1

vector = []

for feature in sorted(features):

count = token_counts.get(feature, 0)

# Normalize by max count in document (alternative: could use raw counts)

normalized = count / max_count

vector.append(normalized)

return vector

# Create feature representations

features = sorted(unigrams)

boolean_repr = boolean_vector(cleaned_text, features)

tf_repr = tf_vector(cleaned_text, features)

# Create a DataFrame for better visualization

df = pd.DataFrame({

'Unigram': sorted(features),
'Boolean': boolean_repr,

'TF': tf_repr

})

print("Feature Representations:")

print(df.to_string(index=False))

You might also like