Name: Muhammad Sarfraz
Seat: EP1850086
Section: A
Course Code: 514
Course Name: Data Warehousing and Data Mining
lab 01
LAB 01 : CONDITIONS
Write an if-else statement in python, which checks if the student is enrolled in 2 or 3 subjects with extra
certification.
Write down proper message for the statement.
One subject fee=1000
Certification fee =700
2 Subjects and 3 certifications are allowed together for a student
If a student selects 3 subjects then only two certifications can be selected to enrolled
In [6]:
subjectFee = 1000
certificationFee = 700
noOfSubjects = 3
noOfCertifications = 2
if(noOfSubjects == 2 and noOfCertifications == 3):
print('Student is enrolled in 2 subjects and 3 certifications')
elif(noOfSubjects == 3 and noOfCertifications == 2):
print('Student is enrolled in 3 subjects and 2 certifications')
else:
print('Student cannot enrolled')
Student is enrolled in 3 subjects and 2 certifications
lab 02
LAB 02 : LOOPS
Initial a list of password of range 10, check if passwords numbers increasing 500 then break the operation
and print else statement “Password cannot be greater than 500” print every password in a new line with a
message “Your new password”.
In [2]:
passwords = [121,55,86,1,147,635,98,63,453,100]
for p in passwords:
if(p > 500):
print('Password cannot be greater than 500')
break
else:
print('Your new password : ',p)
Your new password : 121
Your new password : 55
Your new password : 86
Your new password : 1
Your new password : 147
Password cannot be greater than 500
lab 03
LAB 03 : NumPy & Pandas
In [1]:
import numpy as np
import pandas as pd
In [12]:
df = pd.DataFrame(np.random.randn(4,3),index=['a','b','c','d'], columns=
['one','two', 'three'])
In [13]:
df
Out[13]:
one two three
a 1.968427 0.360732 0.526789
b 0.545311 -0.511318 1.771034
c -1.270482 1.454086 -0.179600
d -1.487337 -0.008176 -0.849439
In [14]:
df['one']
Out[14]:
a 1.968427
b 0.545311
c -1.270482 d
-1.487337
Name: one, dtype: float64
In [15]:
df.loc['a']
Out[15]:
one 1.968427
two 0.360732
three 0.526789
Name: a, dtype: float64
In [16]:
df = df.reindex(['a','b','c','d','e'])
lab 03
In [17]:
df
Out[17]:
one two three
a 1.968427 0.360732 0.526789
b 0.545311 -0.511318 1.771034
c -1.270482 1.454086 -0.179600
d -1.487337 -0.008176 -0.849439
e NaN NaN NaN
In [18]:
df.fillna('0')
Out[18]:
one two three
a 1.96843 0.360732 0.526789
b 0.545311 -0.511318 1.77103
c -1.27048 1.45409 -0.1796
d -1.48734 -0.00817578 -0.849439
e 0 0 0
In [19]:
df
Out[19]:
one two three
a 1.968427 0.360732 0.526789
b 0.545311 -0.511318 1.771034
c -1.270482 1.454086 -0.179600
d -1.487337 -0.008176 -0.849439
e NaN NaN NaN
In [20]:
df = df.fillna('0')
lab 03
In [21]:
df
Out[21]:
one two three
a 1.96843 0.360732 0.526789
b 0.545311 -0.511318 1.77103
c -1.27048 1.45409 -0.1796
d -1.48734 -0.00817578 -0.849439
e 0 0 0
In [22]:
df = df.reindex(columns=['one','two','three','four','fiver'])
In [23]:
df
Out[23]:
one two three four fiver
a 1.96843 0.360732 0.526789 NaN NaN
b 0.545311 -0.511318 1.77103 NaN NaN
c -1.27048 1.45409 -0.1796 NaN NaN
d -1.48734 -0.00817578 -0.849439 NaN NaN
e 0 0 0 NaN NaN
In [24]:
df= df.fillna(1)
In [25]:
df
Out[25]:
one two three four fiver
a 1.96843 0.360732 0.526789 1.0 1.0
b 0.545311 -0.511318 1.77103 1.0 1.0
c -1.27048 1.45409 -0.1796 1.0 1.0
d -1.48734 -0.00817578 -0.849439 1.0 1.0
e 0 0 0 1.0 1.0
lab 03
In [29]:
df =df.rename(columns={'fiver':'five'})
In [30]:
df
Out[30]:
one two three four five
a 1.96843 0.360732 0.526789 1.0 1.0
b 0.545311 -0.511318 1.77103 1.0 1.0
c -1.27048 1.45409 -0.1796 1.0 1.0
d -1.48734 -0.00817578 -0.849439 1.0 1.0
e 0 0 0 1.0 1.0
In [31]:
In [ ]:
# -------------- CREATING NEW DATAFRAME ------------------ #
In [ ]:
In [41]:
data_frame = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6],"C":[7,8,9]})
In [42]:
data_frame
Out[42]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [43]:
data_frame = data_frame.reindex(columns=['A','B','C','D','E'])
lab 03
In [44]:
data_frame
Out[44]:
A B C D E
0 1 4 7 NaN NaN
1 2 5 8 NaN NaN
2 3 6 9 NaN NaN
In [46]:
for i in data_frame:
print(data_frame[i])
0 1
1 2
23
Name: A, dtype: int64
0 4
1 5
26
Name: B, dtype: int64
0 7
1 8
29
Name: C, dtype: int64
0 NaN
1 NaN
2 NaN
Name: D, dtype: float64
0 NaN
1 NaN
2 NaN
Name: E, dtype: float64
lab 03
In [47]:
for i in data_frame:
print(data_frame[i].isnull())
0 False
1 False
2False
Name: A, dtype: bool
0 False
1 False
2False
Name: B, dtype: bool
0 False
1 False
2False
Name: C, dtype: bool
0 True
1 True
2True
Name: D, dtype: bool
0 True
1 True
2True
Name: E, dtype: bool
In [ ]:
lab 04
LAB 04 : Gradient Descent for Linear Regression
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [3]:
x_quad = [n/10 for n in range(0, 100)]
y_quad = [(n-4)**2+5 for n in x_quad]
plt.figure(figsize = (10,7))
plt.plot(x_quad, y_quad, 'k--')
plt.axis([0,10,0,30])
plt.plot([1, 2, 3], [14, 9, 6], 'ro')
plt.plot([5, 7, 8],[6, 14, 21], 'bo')
plt.plot(4, 5, 'ko')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Quadratic Equation')
Out[3]:
Text(0.5, 1.0, 'Quadratic Equation')
In [4]:
data = pd.read_csv('../ex1data1.txt', names = ['population', 'profit'])
lab 04
In [5]:
data
Out[5]:
population profit
0 6.1101 17.59200
1 5.5277 9.13020
2 8.5186 13.66200
3 7.0032 11.85400
4 5.8598 6.82330
... ... ...
92 5.8707 7.20290
93 5.3054 1.98690
94 8.2934 0.14454
95 13.3940 9.05510
96 5.4369 0.61705
97 rows × 2 columns
In [6]:
X_df = pd.DataFrame(data.population)
y_df = pd.DataFrame(data.profit)
m = len(y_df)
In [7]:
X_df
Out[7]:
population
0 6.1101
1 5.5277
2 8.5186
3 7.0032
4 5.8598
... ...
92 5.8707
93 5.3054
94 8.2934
95 13.3940
96 5.4369
97 rows × 1 columns
lab 04
In [8]:
plt.figure(figsize=(10,8))
plt.plot(X_df, y_df, 'kx')
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
Out[8]:
Text(0, 0.5, 'Profit in $10,000s')
In [9]:
iter = 1000
alpha = 0.01
In [10]:
X_df['intercept'] = 1
In [11]:
X = np.array(X_df)
y = np.array(y_df).flatten()
theta = np.array([0, 0])
lab 04
In [12]:
def cost_function(X, y, theta):
m = len(y)
# Calculate the cost with the given parameters J
= np.sum((X.dot(theta)-y)**2)/2/m
return J
In [13]:
cost_function(X, y, theta)
Out[13]:
32.072733877455676
In [14]:
def gradient_descent(X, y, theta, alpha, iterations):
cost_history = [0] * iterations
for iteration in range(iterations):
print(X)
print(np.shape(X))
hypothesis = X.dot(theta)
loss = hypothesis-y
gradient = X.T.dot(loss)/m
theta = theta - alpha*gradient
cost = cost_function(X, y, theta)
cost_history[iteration] = cost
return theta, cost_history
In [28]:
gd = gradient_descent(X,y,theta,alpha, iter)
In [16]:
print(theta)
[0 0]
lab 04
In [17]:
best_fit_x = np.linspace(0, 25, 20)
best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x]
plt.figure(figsize=(10,6))
plt.plot(X_df.population, y_df, '.')
plt.plot(best_fit_x, best_fit_y, '-')
plt.axis([0,25,-5,25])
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.title('Profit vs. Population with Linear Regression Line')
plt.show()
In [ ]:
In [ ]:
Search a dataset for Linear Regression and apply same algorithm on your
dataset. Print the optimized parameters and visualizations and attach in your
file. Also attach the code of this part in your file.
In [18]:
data = pd.read_csv('../exam_result.csv')
lab 04
In [19]:
data.head()
Out[19]:
SAT GPA
0 1714 2.40
1 1664 2.52
2 1760 2.54
3 1685 2.74
4 1693 2.83
In [20]:
X_df = pd.DataFrame(data.SAT)
y_df = pd.DataFrame(data.GPA)
m = len(y_df)
lab 04
In [21]:
plt.figure(figsize=(10,8))
plt.plot(X_df, y_df, 'kx')
plt.xlabel('Score of SAT')
plt.ylabel('Obtained GPA')
Out[21]:
Text(0, 0.5, 'Obtained GPA')
In [22]:
iter = 1000
alpha = 0.01
In [23]:
X_df['intercept'] = 1
In [24]:
X = np.array(X_df)
y = np.array(y_df).flatten()
theta = np.array([0, 0])
In [25]:
cost_function(X, y, theta)
Out[25]:
5.581691666666667
lab 04
In [29]:
gd = gradient_descent(X,y,theta,alpha, iter)
In [27]:
best_fit_x = np.linspace(0, 5000, 20)
best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x]
plt.figure(figsize=(10,6))
plt.plot(X_df.SAT, y_df, '.')
plt.plot(best_fit_x, best_fit_y, '-')
plt.axis([0,5000,-1,4])
plt.xlabel('Score of SAT')
plt.ylabel('Obtained GPA')
plt.title('SAT Score vs. GPA')
plt.show()
In [ ]:
lab 05
LAB 05 : Naive Bayes
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes.
This algorithm is mostly used in text classification and with problems having multiple classes.
In [1]:
from sklearn.naive_bayes import GaussianNB
import numpy as np
In [2]:
#assigning predictor and target variables
x= np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-4,0], [-1,1], [1,1], [-2,2],
[2,7] , [-4,1], [-2,7]])
Y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])
In [8]:
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training
sets model.fit(x, Y)
Out[8]:
GaussianNB()
In [13]:
#Predict Output
predicted= model.predict([[1,2],[3,4]])
print (predicted)
[3 4]
In [ ]:
In [ ]:
Convert the “Play tennis” example discussed in class into numeric form and initialize x and y
values based on that example.
Now run the code for the new x values as discussed in class and print the output.
Attach code and output in file.
lab 05
In [18]:
# 0 - Overcast
# 1 - Sunny
# 2 - Rainy
X_data = np.array([[1,0],[0,1],[2,1],[1,1],[1,1],[0,1],[2,0],[2,0], [1,1],[2,1],
[1,0],[0,1],[0,1],[2,0]])
In [20]:
Y_data = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0])
In [23]:
model = GaussianNB()
model.fit(X_data, Y_data)
Out[23]:
GaussianNB()
In [28]:
predicted= model.predict([[2,0],[2,1],[2,2]])
print (predicted)
[011]
In [ ]:
lab 06
LAB 06 : Decision Tree Using Scikit Learn
In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier from
sklearn.metrics import accuracy_score from sklearn
import tree
In [13]:
balance_data = pd.read_csv('../balance-scale.data',sep= ',',header=None)
balance_data.head()
Out[13]:
0 1 2 3 4
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5
In [14]:
print("Dataset Lenght:: ", len(balance_data))
print("Dataset Shape:: ", balance_data.shape)
Dataset Lenght:: 625
Dataset Shape:: (625, 5)
In [15]:
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
In [18]:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size =
0.3, random_state = 100)
In [19]:
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state =
100, max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
print(clf_entropy)
DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=
5,
random_state=100)
lab 06
In [20]:
y_pred_en = clf_entropy.predict(X_test)
print(y_pred_en)
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']
In [21]:
print ("Accuracy is ", accuracy_score(y_test,y_pred_en)*100)
Accuracy is 70.74468085106383
In [22]:
with open("balanceScale.txt", "w") as f:
f = tree.export_graphviz(clf_entropy, out_file=f)
In [23]:
from IPython.display import Image
Image(filename='lab_06_1.PNG')
Out[23]:
In [ ]:
Apply same code on any other dataset from uci machine learning
repository write the outputs (accuracy, tree and its visualization)
In [149]:
machine_data = pd.read_csv('../machine.data',header=None)
lab 06
In [150]:
machine_data.head()
Out[150]:
0 1 2 3 4 5 6 7 8 9
0 adviser 32/60 125 256 6000 256 16 128 198 199
1 amdahl 470v/7 29 8000 32000 32 8 32 269 253
2 amdahl 470v/7a 29 8000 32000 32 8 32 220 253
3 amdahl 470v/7b 29 8000 32000 32 8 32 172 253
4 amdahl 470v/7c 29 8000 16000 32 8 16 132 132
In [151]:
print("Dataset Lenght:: ", len(machine_data))
print("Dataset Shape:: ", machine_data.shape)
Dataset Lenght:: 209
Dataset Shape:: (209, 10)
In [152]:
X = machine_data.values[:, 2:3]
Y = machine_data.values[:,0]
In [153]:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size =
0.3, random_state = 100)
In [155]:
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state =
100, max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
print(clf_entropy)
DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=
5,
random_state=100)
In [156]:
y_pred_en = clf_entropy.predict(X_test)
print(y_pred_en)
['nas' 'amdahl' 'ibm' 'harris' 'harris' 'nas' 'nas' 'nas' 'harris'
'amdahl' 'nas' 'nas' 'nas' 'nas' 'harris' 'amdahl' 'nas' 'harris' 'nas'
'nas' 'harris' 'burroughs' 'harris' 'amdahl' 'harris' 'nas' 'nas' 'nas'
'ibm' 'ibm' 'nas' 'harris' 'harris' 'nas' 'burroughs' 'nas' 'nas'
'amdahl' 'ibm' 'ibm' 'harris' 'amdahl' 'harris' 'honeywell' 'nas' 'nas'
'harris' 'honeywell' 'nas' 'nas' 'honeywell' 'harris' 'nas' 'harris'
'amdahl' 'nas' 'harris' 'harris' 'harris' 'burroughs' 'nas' 'harris'
'ibm']
lab 06
In [157]:
with open("machine_data.txt", "w") as f:
f = tree.export_graphviz(clf_entropy, out_file=f)
In [158]:
Image(filename='lab_06_2.PNG')
Out[158]:
lab 07
Lab 07 : Performance Metrics
In [12]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score from
sklearn.metrics import log_loss
In [13]:
X_actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
Y_predic = [1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
In [14]:
results = confusion_matrix(X_actual, Y_predic)
print ('Confusion Matrix :')
print(results)
Confusion Matrix :
[[3 3]
[1 3]]
In [15]:
print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))
print ('Classification Report : ')
print (classification_report(X_actual, Y_predic))
print('AUC-ROC:',roc_auc_score(X_actual, Y_predic))
print('LOGLOSS Value is',log_loss(X_actual, Y_predic))
Accuracy Score is 0.6
Classification Report :
precision recall f1-score support
0 0.75 0.50 0.60 6
1 0.50 0.75 0.60 4
accuracy 0.60 10
macro avg 0.62 0.62 0.60 10
weighted avg 0.65 0.60 0.60 10
AUC-ROC: 0.625
LOGLOSS Value is 13.815750437193334
In [ ]:
Why we use performance matrices in machine
learning.
lab 07
Performance metrics are use to evaluate different machine learning algorithms.
Using performance metrics helps to justify the accuracy of your model/algorithm.
Task
We have a confusion matric. This indicated the number of cancer patients tested and who came actually true .
write the code in python to calculate the classification accuracy and classification report of the given data.
In [638]:
X_actual = [1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0]
In [639]:
Y_predic = [1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0]
lab 07
In [640]:
results = confusion_matrix(X_actual, Y_predic)
print ('Confusion Matrix :')
print(results)
Confusion Matrix :
[[ 50 10]
[ 5 100]]
In [641]:
print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))
print ('Classification Report : ')
print (classification_report(X_actual, Y_predic))
Accuracy Score is 0.9090909090909091
Classification Report :
precision recall f1-score support
0 0.91 0.83 0.87 60
1 0.91 0.95 0.93 105
accuracy 0.91 165
macro avg 0.91 0.89 0.90 165
weighted avg 0.91 0.91 0.91 165
lab 09
LAB 09 : K-Means
In [19]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
In [20]:
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)
In [21]:
plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()
In [22]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
lab 09
In [23]:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100,
alpha=0.9); plt.show()
In [ ]:
What is importance of K- mean theorem in clustering algorithms
of machine learning .
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K.
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Choosing manually.
Being dependent on initial values.
lab 09
Write a code snippet in python to perform k mean algorithm
implementation on a data set. create 10 clusters and calculate
ceroids of data. And visualized them.
In [40]:
X, y_true = make_blobs(n_samples=1000, centers=10, cluster_std=1.5, random_state=0)
In [47]:
plt.scatter(X[:, 0], X[:, 1], s=5);
plt.show()
In [48]:
kmeans = KMeans(n_clusters=10)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
In [52]:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=50,
alpha=1); plt.show()
lab 10
Lab 10 : Hierarchical Clustering
In [15]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
from sklearn.cluster import
AgglomerativeClustering import
scipy.cluster.hierarchy as shc %matplotlib inline
In [16]:
data=pd.read_csv('../Wholesale customers data.csv')
data.head()
Out[16]:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 5410 7198 3915 1777 5185
In [17]:
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
Out[17]:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 0.000112 0.000168 0.708333 0.539874 0.422741 0.011965 0.149505 0.074809
1 0.000125 0.000188 0.442198 0.614704 0.599540 0.110409 0.206342 0.111286
2 0.000125 0.000187 0.396552 0.549792 0.479632 0.150119 0.219467 0.489619
3 0.000065 0.000194 0.856837 0.077254 0.272650 0.413659 0.032749 0.115494
4 0.000079 0.000119 0.895416 0.214203 0.284997 0.155010 0.070358 0.205294
lab 10
In [18]:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
lab 10
In [19]:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
Out[19]:
<matplotlib.lines.Line2D at 0x233621b01c0>
lab 10
In [20]:
cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='ward')
cluster.fit_predict(data_scaled)
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
Out[20]:
<matplotlib.collections.PathCollection at 0x23362332850>
In [ ]:
How hierarchical clustering has importance over other
algorithms.
The advantage of hierarchical clustering is that it is easy to understand and implement. The dendrogram
output of the algorithm can be used to understand the big picture as well as the groups in your data.
lab 12
LAB 12 : PCA
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
# import warnings
# warnings.filterwarnings("ignore")
In [3]:
m_data = pd.read_csv('../mushrooms.csv')
In [4]:
m_data.head()
Out[4]:
stalk
class cap- cap- cap- bruises odor gill- gill- gill- gill- ... surface
shape surface color attachment spacing size color below
rin
0 p x s n t p f c n k ...
1 e x s y t a f c b k ...
2 e b s w t l f c b n ...
3 p x y w t p f c n n ...
4 e x s g f n f w b k ...
5 rows × 23 columns
In [5]:
encoder = LabelEncoder()
# Now apply the transformation to all the
columns: for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
In [6]:
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
lab 12
In [7]:
# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5,
align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal
components') plt.show()
lab 12
In [8]:
pca2 = PCA(n_components=17)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data['class'])
plt.show()
In [ ]:
What is the difference between supervised and
unsupervised dimensionality reduction analysis test?
In a supervised learning model, the algorithm learns on a labeled dataset, providing an answer key that the
algorithm can use to evaluate its accuracy on training data. An unsupervised model, in contrast, provides
unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.
lab 12
In [ ]:
Write a PCA implementation over a data set on the following link
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin
(Diagnostic
(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin
(Diagnostic)) and attach it in your lab file .
In [11]:
m_data = pd.read_csv('../breast-cancer-wisconsin.data',header=None)
In [12]:
m_data.head()
Out[12]:
0 1 2 3 4 5 6 7 8 9 10
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2
In [13]:
encoder = LabelEncoder()
# Now apply the transformation to all the
columns: for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
In [16]:
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
lab 12
In [24]:
# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(10), pca_variance, alpha=0.5, align='center', label='individual variance'
)
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal
components') plt.show()
lab 12
In [33]:
pca2 = PCA(n_components=10)
pca2.fit(X_features)
x_3d = pca2.transform(X_features)
plt.figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[0])
plt.show()
6/6