Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views5 pages

AIML Hon. Practical 4

Uploaded by

Saniya Bonde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

AIML Hon. Practical 4

Uploaded by

Saniya Bonde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Practical Assignment 4

Name:
CRN No:
Course: 310302: Computational Programming Laboratory
Instructor: Prof.

Title: Apply Basic PCA on the iris Dataset

Date of Completion:

Assignment Objectives:

● Describe the data set. Should the dataset been standardized?

● Describe the structure of correlations among variables.

● Compute a PCA with the maximum number of components .

● Compute the cumulative explained variance ratio. Determine the number of


components𝐾by your computed values.

● Print the 𝐾principal components directions and correlations of the𝐾principal components


with the original variables. Interpret the contribution of the original variables into the PC.

● Plot the samples projected into the𝐾first PCs.

● Color samples by their species.

Problem Statement:
Perform Principal Component Analysis (PCA) on the Iris dataset to reduce its dimensionality
while retaining most of the variance in the data. Analyze the relationships between the original
features and the derived principal components, and visualize the data in a lower-dimensional
space to examine how well the species are separated.
Software and Hardware Requirements:
Software:

● Python 3.x
● Libraries: pandas, numpy, seaborn, Standard Scaler , matplotlib ,PCA
Hardware:

● A computer with at least 4 GB of RAM


● Operating System: Windows, macOS, or Linux

Theory:

1. Describe the Dataset: Should the Dataset Be Standardized?

As previously discussed, the Iris dataset contains 150 samples and 4 features:

 Sepal Length

 Sepal Width

 Petal Length

 Petal Width

Since all features are in centimeters, they have different ranges but similar scales.
Standardization is typically recommended before performing Principal Component Analysis
(PCA), as PCA is sensitive to the variance in the data and focuses on directions with the most
variance. Therefore, yes, the dataset should be standardized.

2. Describe the Structure of Correlations Among Variables

To describe correlations, we can compute a correlation matrix to see how the features are related
to each other. High correlations between some variables may suggest redundancy or overlapping
information, which PCA will help capture.

3. Compute PCA with the Maximum Number of Components

We will perform PCA to reduce the dimensionality of the dataset. Since we have 4 features, the
maximum number of principal components (PCs) is 4.

4. Compute the Cumulative Explained Variance Ratio

We will compute the cumulative explained variance ratio, which tells us how much of the total
variance in the data is explained by each principal component. We will determine the number of
components KKK that account for most of the variance.
5. Print the Principal Components Directions and Correlations

We will examine the loading matrix (the principal component directions) and the correlation
between the original variables and the KKK principal components. This helps in understanding
which original features contribute the most to each principal component.

6. Plot the Samples Projected into the First KKK PCs

We will project the data onto the first KKK principal components and visualize it in 2D or 3D.

7. Color Samples by Their Species

We will color-code the samples based on their species to see how well the PCA separates the
different species.

Let me now start implementing the steps in Python:

1. Standardize the dataset.

2. Compute the correlation matrix.

3. Perform PCA.

4. Compute cumulative explained variance.

5. Plot the projection onto principal components.

6. Color the samples by species.


Conclusion:
The Principal Component Analysis (PCA) of the Iris dataset provides insights into the
structure of the data, reducing its dimensionality while preserving most of the variance.

1. Standardization: Standardizing the dataset was essential as the features had different
ranges. PCA is sensitive to these differences, and standardization ensures that no feature
dominates simply due to its scale.

2. Correlation Matrix: The correlation matrix revealed how strongly each feature was
related to the others. Features like petal length and petal width likely showed strong
positive correlations, suggesting redundancy, while sepal width might have been less
correlated with other features.

3. Principal Components: We computed the first four principal components, and they
captured different aspects of the data:

o The first two principal components (PC1 and PC2) typically explain the majority
of the variance. They highlight combinations of the original features that best
represent the spread of the data.

o Petal length and petal width often contribute most to the variance in PC1, while
sepal length and sepal width might contribute more to PC2.

4. Cumulative Explained Variance: The cumulative explained variance ratio showed how
much variance was captured as we added each new component. Typically, the first two
principal components (PC1 and PC2) capture around 95% of the total variance, making
them sufficient for most purposes.

5. Projection and Visualization: Projecting the data into the first two principal components
provided a clear visualization. When the samples were colored by species, the Setosa
species was often well-separated from the others, while Versicolor and Virginica
showed some overlap. This suggests that Setosa is more distinct, while the other two
species have more similar patterns in their features.

You might also like