Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views7 pages

3.program PCA

The document outlines an experiment to implement Principal Component Analysis (PCA) for reducing the Iris dataset's dimensionality from 4 features to 2. PCA is a technique that retains variance while simplifying data, aiding in visualization and improving model performance. The process includes standardization, covariance matrix computation, eigenvalue/eigenvector calculation, and transformation of the dataset, ultimately allowing for effective classification and visualization of iris species.

Uploaded by

1bi22cd016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

3.program PCA

The document outlines an experiment to implement Principal Component Analysis (PCA) for reducing the Iris dataset's dimensionality from 4 features to 2. PCA is a technique that retains variance while simplifying data, aiding in visualization and improving model performance. The process includes standardization, covariance matrix computation, eigenvalue/eigenvector calculation, and transformation of the dataset, ultimately allowing for effective classification and visualization of iris species.

Uploaded by

1bi22cd016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Experiment 3

Develop a program to implement Principal


Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features
to 2.

Introduction to Principal Component


Analysis (PCA)
What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique
used to transform a high-dimensional dataset into a lower-dimensional space
while retaining as much variance as possible. It is an unsupervised learning
method commonly used in
machine learning and data visualization.

Importance of PCA
Reduces computational complexity by lowering the number of
features. Helps in visualizing high-dimensional data.
Removes redundant or correlated features, improving model
performance. Reduces overfitting by eliminating noise in the
data.

How Does PCA Work?


PCA follows these key steps:

1.Standardization: The data is normalized so that all features have a mean


of zero and a standard deviation of one.
2.Compute the Covariance Matrix: This step helps in understanding how different
features relate to each other.
3.Eigenvalue & Eigenvector Calculation: Eigenvectors represent the
direction of the new feature axes, and eigenvalues determine the
importance of these axes.
4.Selecting Principal Components: The eigenvectors corresponding
to the highest eigenvalues are chosen to form the new feature space.
5.Transforming Data: The original dataset is projected onto the new
feature space with reduced dimensions.

Applying PCA to the Iris Dataset


The Iris dataset consists of 4 numerical features (sepal length, sepal width,
petal length, petal width) used to classify flowers into 3 species (Setosa,
Versicolor, and Virginica).

Goal: Reduce the 4-dimensional feature space to 2 principal


components while retaining most of the variance.
Benefit: Enables 2D visualization of the dataset, making it easier to
interpret classification results.

Understanding PCA Output


1.Variance Explained by Each Principal Component
PCA provides explained variance ratios, which indicate how much information
each principal component retains.

If PC1 explains 70% and PC2 explains 20%, then the first two principal
components capture 90% of the variance in the dataset.

2.Scatter Plot of PCA-Reduced Data


A 2D scatter plot of PCA-transformed features allows us to visualize how well
PCA separates different species in the Iris dataset.

3.Impact of PCA on Classification


If PCA preserves most of the variance, classification algorithms (e.g., k-NN,
SVM) can achieve similar performance with fewer features.
If too much information is lost, classification accuracy may decrease.

Benefits of PCA
Feature Reduction: Reduces the number of variables without significant
loss of information.
Noise Reduction: Removes redundant or less informative features.
Improved Visualization: Enables easier interpretation of high-dimensional data.
Better Model Performance: Enhances efficiency in training machine learning
models.

In [5]: # Introduction to the Iris Dataset


# The Iris dataset is one of the most well-known datasets in machine
learning and s # It contains 150 samples of iris flowers categorized into
three species: Setosa, V

#
# The goal of using PCA in this exercise is to reduce these four features
into two # This will help in visualizing the data better and understanding
its underlying st #
# Since humans struggle to visualize data in more than three dimensions,
# retain the most important patterns while making it easier to interpret.
PCA helps # preserving as much variance as possible.

Explanation of Features in the Iris Dataset

The Iris dataset consists of 4 features, which represent different physical


characteristics of iris flowers:

Sepal Length
(cm) Sepal
Width (cm)
Petal Length
(cm) Petal
Width (cm)

These features were chosen because they effectively differentiate between the
three iris species (Setosa, Versicolor, and Virginica).

In the 3D visualizations, we select three features for plotting, which are:

Feature 1 → Sepal
Length Feature 2 →
Sepal Width Feature 3
→ Petal Length

These features are chosen arbitrarily for visualization, but all four features are
used in the PCA computation. Why is the Iris Dataset Important?

The Iris dataset is a benchmark dataset in machine learning because:

It is small yet diverse, making it easy to analyze.


It has clearly separable classes, which makes it ideal for
classification tasks.
It is preloaded in Scikit-learn, making it accessible for
learning and experimentation.

Since the dataset contains three classes (Setosa, Versicolor, and Virginica), PCA
helps visualize how well the classes can be separated in a lower-dimensional
space.

In [6]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Step 1: Load the Iris Dataset


iris = datasets.load_iris()
X = iris.data # Extracting feature matrix (4D data)
y = iris.target # Extracting labels (0, 1, 2 representing three iris
species)
# Step 2: Standardizing the Data
# PCA works best when data is standardized (mean = 0, variance = 1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Calculating Covariance Matrix and Eigenvalues/Eigenvectors


# The foundation of PCA is eigen decomposition of the covariance matrix
cov_matrix = np.cov(X_scaled.T)
print(cov_matrix)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

# Step 4: Visualizing Data in 3D before PCA


fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111,
projection='3d') colors = ['red',
'green', 'blue']
labels = iris.target_names
for i in range(len(colors)):
ax.scatter(X_scaled[y == i, 0], X_scaled[y == i, 1], X_scaled[y == i, 2],
color=colors[i])
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Petal Length')
ax.set_title('3D Visualization of Iris Data Before
PCA') plt.legend()
plt.show()

# Step 5: Applying PCA using SVD (Singular Value Decomposition)


# PCA internally relies on SVD, which decomposes a matrix into three parts: U,
S, a
U, S, Vt = np.linalg.svd(X_scaled, full_matrices=False)
print("Singular Values:", S)

# Step 6: Applying PCA to Reduce Dimensionality to 2D


# We reduce 4D data to 2D for visualization while retaining maximum variance
pca = PCA(n_components=2) # We choose 2 components because we want to
visualize
X_pca = pca.fit_transform(X_scaled) # Transform data into principal components

# Step 7: Understanding Variance Explained


# PCA provides the percentage of variance retained in each principal component
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance by PC1: {explained_variance[0]:.2f}")
print(f"Explained Variance by PC2: {explained_variance[1]:.2f}")

# Step 8: Visualizing the Transformed Data


# We plot the 2D representation of the Iris dataset after PCA transformation
plt.figure(figsize=(8, 6))
for i in range(len(colors)):
plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=colors[i],
label=labels[i

plt.xlabel('Principal Component 1')


plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset (Dimensionality
Reduction)') plt.legend()
plt.grid()
plt.show()
# Step 9: Visualizing Eigenvectors Superimposed on 3D Data
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
for i in range(len(colors)):
ax.scatter(X_scaled[y == i, 0], X_scaled[y == i, 1], X_scaled[y == i,
2], color
for i in range(3): # Plot first three eigenvectors
ax.quiver(0, 0, 0, eigenvectors[i, 0], eigenvectors[i, 1],
eigenvectors[i, 2], ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Petal Length')
ax.set_title('3D Data with Eigenvectors')
plt.legend()
plt.show()

# Recap:
# - The Iris dataset is historically important for testing classification
models. # - We standardized the data to ensure fair comparison across
features.
# - We calculated the covariance matrix, eigenvalues, and eigenvectors.
# - PCA is built on SVD, which decomposes data into important
components. # - We visualized the original 3D data and superimposed
eigenvectors.
[[ 1.00671141 -0.11835884 0.87760447 0.82343066]
[- 1.00671141 - -0.36858315]
0.11835884 0.43131554
[ 0.8776044 -0.43131554 1.00671141 0.96932762]
7
[ 0.8234306 -0.36858315 0.96932762 1.00671141]]
6
Eigenvalues: [2.93808505 0.9201649 0.14774182 0.02085386]
Eigenvectors:
[[ 0.52106591 - -0.71956635
0.37741762 0.26128628]
[- -0.92329566 0.24438178 -0.12350962]
0.26934744
[ 0.5804131 -0.02449161 0.14212637 -0.80144925]
[ 0.5648565 -0.06694199 0.63427274 0.52359713]
4 ]
Singular Values: [20.92306556 11.7091661 4.69185798 1.76273239]
Explained Variance by PC1:
0.73 Explained Variance by
PC2: 0.23

You might also like