Computer Laboratory-I
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
Assignment No:1-A
Title: To use PCA Algorithm for dimensionality reduction.
You have a dataset that includes measurements for different variables on wine (alcohol,
ash, magnesium, and so on). Apply PCA algorithm & transform this data so that most
variations in the measurements of the variables are captured by a small number of
principal components so that it is easier to distinguish between red and white wine by
inspecting these principal components.
Dataset Link: https://media.geeksforgeeks.org/wp-content/uploads/Wine.csv
Objectives: To make use of PCA algorithm
To transform the data in reduced form
Theory:
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances. The PCA algorithm is based on some mathematical concepts such as:
○ Variance and Covariance
○ Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
○ Dimensionality: It is the number of features or variables present in the given dataset.
More easily, it is the number of columns present in the dataset.
○ Correlation: It signifies how strongly two variables are related to each other. Such as if
one changes, the other variable also gets changed. The correlation value ranges from -1
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1
indicates that variables are directly proportional to each other.
○ Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
○ Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will
be eigenvector if Av is the scalar multiple of v.
○ Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.
Steps for PCA algorithm
1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where
X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the
Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.
Sample Code
In[1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the
input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as
output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the
current session
/kaggle/input/wineuci/Wine.csv
In [2]:
#------------------Import_libraries------------------
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
import seaborn as sns
sns.set()
In [3]:
df = pd.read_csv("/kaggle/input/wineuci/Wine.csv")
In [4]:
#--------------print_sample_of_dataset------------------
df.head()
Out[4]:
14.2 1.7 2.4 15. 12 3.0 2.2 5.6 1.0 3.9 106
1 2.8 .28
3 1 3 6 7 6 9 4 4 2 5
13.2 1.7 2.1 11. 10 2.6 2.7 0.2 1.2 4.3 1.0 3.4 105
0 1
0 8 4 2 0 5 6 6 8 8 5 0 0
13.1 2.3 2.6 18. 10 2.8 3.2 0.3 2.8 5.6 1.0 3.1 118
1 1
6 6 7 6 1 0 4 0 1 8 3 7 5
14.3 1.9 2.5 16. 11 3.8 3.4 0.2 2.1 7.8 0.8 3.4 148
2 1
7 5 0 8 3 5 9 4 8 0 6 5 0
13.2 2.5 2.8 21. 11 2.8 2.6 0.3 1.8 4.3 1.0 2.9
3 1 735
4 9 7 0 8 0 9 9 2 2 4 3
14.2 1.7 2.4 15. 11 3.2 3.3 0.3 1.9 6.7 1.0 2.8 145
4 1
0 6 5 2 2 7 9 4 7 5 5 5 0
In [5]:
#---------------Check_dataset_information--------------
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 177 non-null int64
1 14.23 177 non-null float64
2 1.71 177 non-null float64
3 2.43 177 non-null float64
4 15.6 177 non-null float64
5 127 177 non-null int64
6 2.8 177 non-null float64
7 3.06 177 non-null float64
8 .28 177 non-null float64
9 2.29 177 non-null float64
10 5.64 177 non-null float64
11 1.04 177 non-null float64
12 3.92 177 non-null float64
13 1065 177 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
In [6]:
#---------------Check_distribution_of_dataset----------------------
df.describe()
Out[6]:
14. 1.7 2.4 15. 3.0 2.2 5.6 1.0 3.9 106
1 127 2.8 .28
23 1 3 6 6 9 4 4 2 5
c
177 177 177 177 177 177 177 177 177 177 177 177 177
o 177.
.00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
u 000
000 000 000 000 000 000 000 000 000 000 000 000 000
n 000
0 0 0 0 0 0 0 0 0 0 0 0 0
t
m
1.9 12. 2.3 2.3 19. 99. 2.2 2.0 0.3 1.5 5.0 0.9 2.6 745.
e
435 993 398 661 516 587 922 234 623 869 548 569 042 096
a
03 672 87 58 949 571 60 46 16 49 02 83 94 045
n
s 0.7 0.8 1.1 0.2 3.3 14. 0.6 0.9 0.1 0.5 2.3 0.2 0.7 314.
t 739 088 193 750 360 174 264 986 246 715 244 291 051 884
d 91 08 14 80 71 018 65 58 53 45 46 35 03 046
m 1.0 11. 0.7 1.3 10. 70. 0.9 0.3 0.1 0.4 1.2 0.4 1.2 278.
i 000 030 400 600 600 000 800 400 300 100 800 800 700 000
n 00 000 00 00 000 000 00 00 00 00 00 00 00 000
2 1.0 12. 1.6 2.2 17. 88. 1.7 1.2 0.2 1.2 3.2 0.7 1.9 500.
5 000 360 000 100 200 000 400 000 700 500 100 800 300 000
% 00 000 00 00 000 000 00 00 00 00 00 00 00 000
5 2.0 13. 1.8 2.3 19. 98. 2.3 2.1 0.3 1.5 4.6 0.9 2.7 672.
0 000 050 700 600 500 000 500 300 400 500 800 600 800 000
% 00 000 00 00 000 000 00 00 00 00 00 00 00 000
107
7 3.0 13. 3.1 2.5 21. 2.8 2.8 0.4 1.9 6.2 1.1 3.1 985.
.00
5 000 670 000 600 500 000 600 400 500 000 200 700 000
000
% 00 000 00 00 000 00 00 00 00 00 00 00 000
0
162 168
m 3.0 14. 5.8 3.2 30. 3.8 5.0 0.6 3.5 13. 1.7 4.0
.00 0.00
a 000 830 000 300 000 800 800 600 800 000 100 000
000 000
x 00 000 00 00 000 00 00 00 00 000 00 00
0 0
In [7]:
#-----------------Check_null_values_in_dataset--------------------
df.isnull().sum()
Out[7]:
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
1 0
14.23 0
1.71 0
2.43 0
15.6 0
127 0
2.8 0
3.06 0
.28 0
2.29 0
5.64 0
1.04 0
3.92 0
1065 0
dtype: int64
In [8]:
#-------------Check_imbalance_in_dataset--------------------
sns.countplot(x = '1',data=df)
Out[8]:
<AxesSubplot:xlabel='1', ylabel='count'>
In [9]:
target = df['1']
df = df.drop('1',axis=1)
In [10]:
#-----------Split_dataset_into_train_test_set--------------
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df,target,test_size =0.20,random_state=42)
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
In [11]:
sns.pairplot(X_train)
Out[11]:
<seaborn.axisgrid.PairGrid at 0x7ff464bfd610>
In [12]:
#------------Implement_scaling-----------
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [13]:
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
In [14]:
sns.pairplot(X_train)
Out[14]:
<seaborn.axisgrid.PairGrid at 0x7ff44f4dab10>
In [15]:
#-----------------Build_classifier_model_using_all_available_variables------
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
model
Out[15]:
LogisticRegression()
In [16]:
#--------Check_model_performance-------------------
from sklearn.metrics import classification_report
print("The classification_report
is:{}".format(classification_report(y_test,model.predict(X_test))))
The classification_report is: precision recall f1-score support
1 1.00 1.00 1.00 14
2 1.00 0.71 0.83 14
3 0.67 1.00 0.80 8
accuracy 0.89 36
macro avg 0.89 0.90 0.88 36
weighted avg 0.93 0.89 0.89 36
In [17]:
#-----------------Check_correlation_between_independent_variables---------------
plt.figure(figsize =(10,8))
sns.heatmap(X_train.corr(),annot=True)
Out[17]:
<AxesSubplot:>
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
In [18]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
tr_comp = pca.fit_transform(X_train)
ts_comp = pca.transform(X_test)
In [19]:
#--------------Plot_PCA-----------------------
sns.scatterplot(tr_comp[:,0],tr_comp[:,1])
plt.xlabel("PC1")
plt.ylabel("PC2")
Out[19]:
Text(0, 0.5, 'PC2')
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
linkcode
Compoents Looks orthogonal to each other
In [20]:
#---------------Build_ml_model_on_extracted_components---------------
from sklearn.linear_model import LogisticRegression
pc_model = LogisticRegression()
pc_model.fit(tr_comp,y_train)
pc_model
Out[20]:
LogisticRegression()
In [21]:
#------------Evaluate_model_performance---------------
from sklearn.metrics import classification_report
print("The classification report is:
{}".format(classification_report(y_test,pc_model.predict(ts_comp))))
The classification report is: precision recall f1-score support
1 1.00 1.00 1.00 14
2 1.00 0.93 0.96 14
3 0.89 1.00 0.94 8
accuracy 0.97 36
macro avg 0.96 0.98 0.97 36
weighted avg 0.98 0.97 0.97 36
Department of Artificial Intelligence and Data Science
Computer Laboratory-I
The performance of logistic regression model is improved after performing principal
component analysis. PCA not only removed some redundancy but also improved variance in the
dataset.
Conclusion:
Student will able to analyze the importance of PCA in dimension reduction
Reference:
https://www.kaggle.com/code/bhavesh302/pca-on-wine-dataset
ASSIGNMENT QUESTION: 1
Q1. What is PCA, and how does it work in machine learning?
Q2. How is PCA used for dimensionality reduction, and why is it important?
Q3. When is PCA typically used, and what are some scenarios where it might not be
suitable?
Q4. What is LDA, and how does it differ from PCA?
Q5. Can PCA and LDA be used together in a machine learning pipeline, and if so, how?
Q6. What are some common use cases or examples where PCA and LDA have been
successfully applied in machine learning?
Department of Artificial Intelligence and Data Science