Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views48 pages

Module 4

Dimensionality reduction is a data processing technique that reduces the number of input variables while preserving relevant information, crucial for high-dimensional data. Two main approaches are feature selection, which retains the most relevant features, and feature extraction, which creates new features from existing ones, with PCA and LDA being popular methods. LDA is a supervised technique that maximizes class separability, while PCA is unsupervised and captures maximum variance in the data.

Uploaded by

6cm872mffs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views48 pages

Module 4

Dimensionality reduction is a data processing technique that reduces the number of input variables while preserving relevant information, crucial for high-dimensional data. Two main approaches are feature selection, which retains the most relevant features, and feature extraction, which creates new features from existing ones, with PCA and LDA being popular methods. LDA is a supervised technique that maximizes class separability, while PCA is unsupervised and captures maximum variance in the data.

Uploaded by

6cm872mffs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Module 4

Dimensionality Reduction

 Dimensionality reduction is a technique in data processing that reduces


the number of input variables in a dataset while preserving as much
relevant information as possible.
 This is crucial for high-dimensional data, which often has redundancies
and irrelevant features, leading to complex models and slower
computation.
 Eg:
 Image processing
 Time series analysis
 Automatic text analysis
DR
 Two main approaches in dimensionality reduction are:
1. Feature Selection: Identifying and keeping only the most relevant
features and discarding others based on criteria like correlation, mutual
information etc.
1. In feature selection, we are interested in finding k of the total of n
features that give us the most information and we discard the other
(n−k) dimensions. We are going to discuss subset selection as a
feature selection method.

2. Feature Extraction: Creating new features from existing ones,


capturing the essential information but in a reduced number of
dimensions.
Feature Extraction
 In feature extraction, we are interested in finding a new set of k features
that are the combination of the original n features.
 These methods may be supervised or unsupervised depending on
whether or not they use the output information.
 The best known and most widely used feature extraction methods are
Principal Components Analysis (PCA) and Linear Discriminant
Analysis (LDA),
 Principal Component Analysis (PCA): Projects data into a lower-
dimensional space where variance is maximized, preserving as much of
the data's spread as possible.
 Linear Discriminant Analysis (LDA): A supervised method that projects
data based on maximizing class separability, useful when you have
labeled data.
Linear Discriminant Analysis

 Linear Discriminant Analysis (LDA), also known as Normal


Discriminant Analysis or Discriminant Function Analysis, is a
dimensionality reduction technique primarily utilized in supervised
classification problems.
 It facilitates the modeling of distinctions between groups, effectively
separating two or more classes.
 LDA operates by projecting features from a higher-dimensional space
into a lower-dimensional one.
Assumptions of LDA
 LDA assumes that the data has a Gaussian distribution and
that the covariance matrices of the different classes are equal.
 It also assumes that the data is linearly separable, meaning
that a linear decision boundary can accurately classify the
different classes.
 In the example below, there’s no straight line that can separate
the two classes of data points completely.
 Hence, in this case, LDA (Linear Discriminant Analysis) is
used which reduces the 2D graph into a 1D graph in order to
maximize the separability between the two classes.
 Two criteria are used by LDA to create a new axis:
 Maximize the distance between the means of the two
classes.
 Minimize the variation within each class.
 But Linear Discriminant Analysis fails when the mean of
the distributions are shared, as it becomes impossible for
LDA to find a new axis that makes both classes linearly
separable.
Extensions to LDA

 Quadratic Discriminant Analysis (QDA): Each class uses its own


estimate of variance (or covariance when there are multiple input
variables).
 Flexible Discriminant Analysis (FDA): Where non-linear combinations
of inputs are used such as splines.
 Regularized Discriminant Analysis (RDA): Introduces regularization
into the estimate of the variance (actually covariance), moderating the
influence of different variables on LDA.
 Consider datapoints, with two classes mean, µ1 and µ2, the mean of the entire dataset

(µ), and the covariance of each class with itself

 The covariance matrix can tell us about the scatter within a dataset, which is the
amount of spread that there is within the data.

 The way to find within class scatter is to multiply the covariance by the pc, the
probability of the class (that is, the number of datapoints there are in that class divided
by the total number).

 within-class scatter of the dataset, Sw


 The distance between the classes to be large. This is known as the between classes
scatter .

 The classes are discriminable when SB/SW is as large as possible


 The projection of the data can be written as

 Where w is the weight vector

Two different possible projection lines. The one on the left fails to separate the classes.
Principal Component Analysis

 Principal Component Analysis (PCA) is an unsupervised learning algorithm


technique used to examine the interrelations among a set of variables.

 It is also known as a general factor analysis where regression determines a line


of best fit.

 Principal Component Analysis (PCA) is a technique for dimensionality reduction


that identifies a set of orthogonal axes, called principal components, that
capture the maximum variance in the data.

 The principal components are linear combinations of the original variables in the
dataset and are ordered in decreasing order of importance.
Purpose of PCA
• PCA finds new coordinate axes (principal components) that better represent
the natural variance in the data
• Unlike LDA, it doesn't use class labels - it's an unsupervised method
• The goal is dimensionality reduction while preserving important information
• How PCA Works:
• It finds directions of maximum variance in the data
• The first principal component (PC1) captures the direction of greatest variance
• Each subsequent component is orthogonal to previous ones and captures the
next highest variance
• Components are ordered by importance (amount of variance explained)
Example

Left plot: Original data showing an elliptical pattern at


45° to the original axes

•Right plot:
•Same data after PCA transformation where: The x-axis
(PC1) aligns with the direction of maximum variance
•The y-axis (PC2) captures the remaining variance
Example of LDA and PCA

The pixels of an image. Converting the pixels into


Plot of the iris data showing the three classes, after LDA has relevant principal components
been applied.
PCA algorithm
•Original Data:
•Start with the raw data that shows correlation between variables
•In this case, we have 2D data with clear directional trend

•Centering the Data:


•Subtract the mean from each dimension
•This moves the "cloud" of points to be centered at the origin
•Critical step as PCA looks for directions of maximum variance through origin

•Finding Principal Components:


•Calculate the covariance matrix of centered data
•Find eigenvectors and eigenvalues of this matrix
•Eigenvectors show directions of maximum variance
•Eigenvalues tell us how much variance is explained by each direction
•First PC points in direction of maximum variance
•Second PC is orthogonal to first and captures remaining variance
•The transformation rotates the data (P is the rotation matrix)
PCA Algorithm
 Write N datapoints xi = (x1i, x2i, . . . , xM i) as row vectors
 • Put these vectors into a matrix X (which will have size N × M)
 • Centre the data by subtracting off the mean of each column, putting it into
matrix B
 • Compute the covariance matrix
 • Compute the eigenvalues and eigenvectors of C, so ,
 where V holds the eigenvectors of C and D is the M × M diagonal eigenvalue
matrix
 • Sort the columns of D into order of decreasing eigenvalues, and apply the
same order to the columns of V
 • Reject those with eigenvalue less than some η, leaving L dimensions in the
data
The Kernel PCA Algorithm

 Kernel PCA, which is indeed a powerful nonlinear extension of regular PCA


that can capture more complex patterns in data.
 Regular PCA finds linear relationships in data Kernel PCA first maps data to a
higher-dimensional feature space using a kernel function
 Then performs PCA in this new space where nonlinear patterns become linear.
• Key Steps: Compute kernel matrix K
• Center the kernel matrix
• Find eigenvectors and eigenvalues
• Project data onto principal components in feature space
Factor analysis
 This approach is powerful because it: Simplifies complex data into
manageable components
 Reveals hidden patterns that aren't immediately obvious
 Helps researchers understand the fundamental structure of their data
 Accounts for measurement error in their analyses
 Discover independent hidden factors
 Measure how much "noise" affects each factor
 Eg: 10 School Test Scores → 3 Basic Abilities
 - Math, Physics, Chemistry → "Scientific Ability“
 - English, History, Literature → "Language Ability“
 - Art, Music, Drama → "Creative Ability“
Factor Analysis

• Data Setup
• Start with a matrix X (N rows × M columns)
• Each row is one datapoint with M features
• Center the data (subtract mean from each column)

 Basic Model
 X = WY + ε
 Where:- X= Data
 W = Weight/loading matrix-
 Y = Hidden factors-
 ε = Noise/error
• Key Assumptions
• Hidden factors (Y) are independent of each other
• Noise (ε) is:
• Normally distributed
• Has zero mean
• Has known variance (Ψ)
• Independent across measurements
 Covariance Structure
 Original Data Covariance = Factor Covariance + Noise Covariance
 Σ=
• Main Goals Find the loading matrix W
• Find the noise variances Ψ
• Use these to:
• Reconstruct original data
• Reduce dimensions
INDEPENDENT COMPONENTS ANALYSIS (ICA)

 Related approach to factor analysis


 ICA finds completely independent components
 Input Eg: Mixed sounds[Party Noise] = [Voice 1 + Voice 2 + Music + Glass clinking]
 Goal: Separate into individual sources- Voice 1- Voice 2- Music- Glass clinking
 Mathematical Model
 Given: Two sound sources (s₁, s₂) and two microphones recording (x₁, x₂)
 x₁ = as₁ + bs₂
 x₂ = cs₁ + ds₂
 X = AS (Mixed Signal = Mixing Matrix × Source Signals )
 Where:- X: What we observe (mixed signals)-
 A: Mixing matrix (how signals combine)-
 S: Original source signals Compute
 x = As (mixed signals = mixing matrix × source signals)
 s = Wx (source signals = unmixing matrix × mixed signals)
 Where:- W is our approximation of A⁻¹-
 W must be square (same number of inputs/outputs)
 Rank sources by importance
 Keep only top K sources
 Discard less significant ones
 Eg: If you have 5 mixed signals:[Speech 1][Music ][Speech 2][Noise [Echo ]
Keep only what you need (speech)
Eg: of a cocktail party

1. For Microphone 1 (x₁)  If: s₁ = person speaking (speech)


• Picks up:  s₂ = music playing (music)
• 'a' amount of source 1 (as₁)  Then:
• 'b' amount of source 2 (bs₂)  x₁ = (0.8 × speech) + (0.2 × music)
2. For Microphone 2 (x₂)  x₂ = (0.3 × speech) + (0.7 × music)
• Picks up:
• 'c' amount of source 1 (cs₁)
• 'd' amount of source 2 (ds₂
 [x₁] = [a b] [s₁]
 [x₂] = [c d] [s₂]
LOCALLY LINEAR EMBEDDING

1. Main Purpose • LLE Algorithm


• Takes high-dimensional data  Step 1: Find Neighbors
• Creates lower-dimensional version  For each point:
• Keeps local relationships intact  Find K nearest neighbors using distance metric:
• Core objective:  D(xᵢ, xⱼ) = ||xᵢ - xⱼ||²
• minimizing reconstruction error  Create local neighborhood :
• Minimize: Σᵢ |xᵢ - Σⱼ wᵢⱼxⱼ|²  Eg: [Point1] → [Neighbor1, Neighbor2, Neighbor3]
• Subject to: Σⱼ wᵢⱼ = 1  Step 2: Calculate Weights
• Where:- xᵢ = data point i  Express each point as mix of neighbors
• wᵢⱼ = weight between points i and j  Point = w1*Neighbor1 + w2*Neighbor2 + w3*Neighbor3*
• xj= neighbor  Weights show connection strength
 Minimize reconstruction error:  Minimize:
 ε = Σᵢ |xᵢ - Σⱼ wᵢⱼxⱼ|²  Φ(Y) = Σᵢ |yᵢ - Σⱼ wᵢⱼyⱼ|²
 With constraints:  Subject to constraints:
 Σⱼ wᵢⱼ = 1  ΣYY^T = I (identity matrix)
 wᵢⱼ = 0 if j is not neighbor of I  Step 4:
 Step 3: Find low-dimensional points  Find eigenvectors of weight matrix-
yᵢ by minimizing:
 Pick top eigenvectors for reduction
 High Dimension (Original Points)
 Number depends on desired output
→ xᵢ = w₁x₁ + w₂x₂ + w₃x₃
dimensions
 Low Dimension (Reconstructed
Points)
 yᵢ = w₁y₁ + w₂y₂ + w₃y₃
Isomap-Isometric Feature Mapping

1. It is a nonlinear dimensionality reduction method that combines


geodesic distances with classical MDS (multi-dimensional scaling).
2. Core Concept:
• ISOMAP tries to preserve the geodesic distances between points on a
manifold (surface)
• It extends MDS by replacing Euclidean distances with geodesic
distances
 Geodesic distance represents the true distances on curved surfaces
2. Why ISOMAP?
• Classical MDS works well for linear manifolds (flat surfaces)
• Real-world data often lies on nonlinear manifolds (curved surfaces)
The Multi-Dimensional Scaling (MDS) Algorithm
Explanation of Algorithm

 Step 1: Compute D, by taking pairwise similarity:


1. Pairwise means you compute distances between:
1. x₁ and x₂
2. x₁ and x₃
3. x₁ and x₄ ...and so on for ALL possible pairs
 Creates a distance matrix:
 Step 2:
 Where N is the number of datapoints: When N=3,
 Step 3: find the largest eigen values of B along with vectors
 Step 4: Put the eigenvalues into a diagonal matrix V and set the eigenvectors to be
columns of matrix P

 Step5: Compute the value of X


ISOMAP Algorithm
Optimization and search

1. Optimization in machine learning is essentially the process of finding the best


parameters (weights and biases) for your model to make it perform as accurately as
possible. Think of it as "tuning" the model to minimize errors.

2. Objective:

• The goal is to minimize a function f(x) where x is a vector of parameters

• We start from an initial guess x(0) and try to find better solutions iteratively

• The solution space can be high-dimensional, making direct search impractical


2. Gradient Descent:
• Uses the gradient ∇f(x) to determine the direction of steepest descent
• The gradient is a vector containing partial derivatives for each dimension

3. Termination Conditions:
• We've reached a solution when ∇f = 0 (local minimum)
• In practice, we use |∇f| < ε where ε is a small number (e.g., 10⁻⁵)
• This accounts for numerical precision limitations in computers
Evolutionary Learning

 Evolutionary learning is a branch of machine learning inspired by biological


evolution, where solutions to problems are evolved rather than programmed.
This approach mimics natural selection, inheritance, and mutation to genes
Introduction

 A population-based optimization method inspired by biological evolution


 Learns by evolving solutions rather than using gradients or fixed rules
 Uses fitness-based selection to improve solutions over generations
2. Key Components
• Population: Set of candidate solutions/models
• Fitness Function: Evaluates solution quality
• Selection: Chooses better solutions for reproduction
• Crossover: Combines good solutions
• Mutation: Introduces random variations
Genetic Algorithm

 A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's


theory of evolution in Nature."

 It is used to solve optimization problems in machine learning.

 It is one of the important algorithms as it helps solve complex problems that would take
a long time to solve.
Basic Terminologies in Genetic
Algorithm?

 Population: Population is the subset of all possible or probable solutions, which can solve the
given problem.

 Chromosomes: A chromosome is one of the solutions in the population for the given problem,
and the collection of gene generate a chromosome.

 Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.

 Allele: Allele is the value provided to the gene within a particular chromosome.

 Fitness Function: The fitness function is used to determine the individual's fitness level in the
population. It means the ability of an individual to compete with other individuals. In every
iteration, individuals are evaluated based on their fitness function.
 Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better than
parents. Here genetic operators play a role in changing the genetic composition of the next generation.

 Selection

 a selection process is used to determine which of the individualities in the population will get to
reproduce and produce the seed that will form the coming generation.
Initialization

 The process of a genetic algorithm starts by generating the set of


individuals, which is called population.

 Here each individual is the solution for the given problem. An


individual contains or is characterized by a set of parameters
called Genes.

 Genes are combined into a string and generate chromosomes,


which is the solution to the problem.

 One of the most popular techniques for initialization is the use of


random binary strings.
Fitness Assignment

 Fitness function is used to determine how fit an individual is?

 It means the ability of an individual to compete with other individuals. In


every iteration, individuals are evaluated based on their fitness function.

 The fitness function provides a fitness score to each individual. This score
further determines the probability of being selected for reproduction.

 The high the fitness score, the more chances of getting selected for
reproduction.
Operators of Genetic Algorithm

 1. Selection Operator: The idea is to give preference to the individuals with good
fitness scores and allow them to pass their genes to successive generations.

 2. Crossover Operator: This represents mating between individuals. Two


individuals are selected using selection operator and crossover sites are chosen
randomly. Then the genes at these crossover sites are exchanged thus creating a
completely new individual (offspring)

 3. Mutation Operator: The key idea is to insert random genes in offspring to


maintain the diversity in the population to avoid premature convergence.
Selection

 The selection phase involves the selection of individuals for the reproduction of
offspring.

 All the selected individuals are then arranged in a pair of two to increase reproduction.
Then these individuals transfer their genes to the next generation.

 There are three types of Selection methods available, which are:

• Roulette wheel selection

• Tournament selection

• Rank-based selection
Reproduction
 After the selection process, the creation of a child occurs in the reproduction step.
 In this step, the genetic algorithm uses two variation operators that are applied to the parent
population.
 The two operators involved in the reproduction phase are given below:
• Crossover operator: The crossover plays a most significant role in the reproduction phase of
the genetic algorithm.
• In this process, a crossover point is selected at random within the genes. Then the crossover
operator swaps genetic information of two parents from the current generation to produce a new
individual representing the offspring.
Mutation Operator
 The mutation operator inserts random genes in the offspring (new child) to maintain the
diversity in the population.
 It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. :
Types of mutation styles available,
• Flip bit mutation
• Gaussian mutation
• Exchange/Swap mutation
Termination

 After the reproduction phase, a stopping criterion is applied as a base


for termination.

 The algorithm terminates after the threshold fitness solution is reached.


It will identify the final solution as the best solution in the population.
Genetic Algorithm steps

 1) Randomly initialize populations p


 2) Determine fitness of population
 3) Until convergence repeat:
 a) Select parents from population
 b) Crossover and generate new population
 c) Perform mutation on new population
 d) Calculate fitness for new population
Use and Applications of GA
 In Machine Learning:
• While GA itself is not an ML model, it can optimize various components of ML
models:
• Hyperparameter tuning: Selecting optimal parameters for models like neural networks,
SVMs, etc.
• Feature selection: Identifying the most relevant features for building ML models.
• Architecture optimization: Designing the structure of neural networks.

 Application of Genetic Algorithms:


 Recurrent Neural Network
 Mutation testing
 Code breaking
 Filtering and signal processing
 Learning fuzzy rule base

You might also like