ML Assignment 2 Report
ML Assignment 2 Report
1 Non-Comp Part
This report analyzes the shared code implementation for audio signal processing and the
corresponding waveform visualizations.
The code begins by mounting Google Drive for data access and resolving a compatibility
issue between NumPy and Librosa libraries. This is a common workaround when working
with audio processing libraries in Colab environments.
1
2.1 Feature Extraction Function
2
48 iois = np . diff ( onset_times )
49
50 features [ ’ tempo ’] = librosa . beat . tempo ( onset_envelope = onset_env , sr = sr )
[0]
51 features [ ’ ioi_mean ’] = np . mean ( iois ) if len ( iois ) > 0 else 0
52 features [ ’ ioi_var ’] = np . var ( iois ) if len ( iois ) > 0 else 0
53
54 return features
Using 20 MFCCs balances broad tone representation with moderate fine-grain detail,
making it ideal for non-speech audio with diverse timbral variations like snoring or
rain.
3. Zero Crossing Rate (ZCR): Measures how often the signal changes sign, useful
for detecting noisiness and tonal quality.
4. Spectral Features: Computes the mean and variance of spectral centroid (bright-
ness) and spectral bandwidth (frequency spread).
5. Root Mean Square (RMS) Energy: Quantifies signal power (loudness) over
time.
6. Tempo and Inter-Onset Interval (IOI): Detects onset times, computes inter-
onset intervals, and estimates tempo to capture rhythmic properties.
3
2.3 Batch Processing
The function is applied to all audio files listed in df labels. For each file:
sample_paths = df_labels.iloc[:900].reset_index(drop=True)[’filename’].iloc[:5].apply(
lambda fn: os.path.join(TRAIN_FOLDER, fn))
for i, fp in enumerate(sample_paths):
y, sr = librosa.load(fp)
plt.figure(figsize=(12, 3))
librosa.display.waveshow(y, sr=sr)
plt.title(f’Waveform {i+1}’)
plt.tight_layout()
plt.show()
This section loads five sample files from the training set and displays their waveforms
to understand the temporal characteristics of the audio data.
4
Figure 1: Percussive audio waveform with distinct transients and quick decay.
Figure 2: Sustained audio waveform with maintained energy throughout the duration.
5
pattern is highly characteristic of percussive sounds like drumbeats, claps, or impact sounds
where energy quickly dissipates after initial excitation.
Waveform 2: This waveform exhibits significantly different characteristics with sus-
tained energy throughout the 4.8-second duration. It features strong initial transients
similar to Waveform 1, but notably contains extended periods of activity between t=0.3s
and t=0.8s and again around t=1.3s-1.5s. Unlike Waveform 1, the signal maintains no-
ticeable amplitude throughout the recording, though gradually decreasing over time. This
pattern suggests continuous audio content such as speech, singing, or sustained musical
passages with complex harmonic content.
Waveform 3: This waveform presents a more regular, rhythmic pattern with sharp
transient peaks occurring at predictable intervals across the entire duration. Significant
peaks appear at approximately t=0.6s, t=0.8s, t=1.5s, t=1.8s, t=2.4s, t=3.3s, t=3.6s,
t=4.2s, and t=4.6s. The consistency in both timing and amplitude suggests structured,
rhythmic content - possibly metronome clicks, rhythmic percussion, or structured speech
with regular emphasis patterns. Unlike the previous waveforms, the amplitude remains
relatively consistent throughout the recording without significant decay.
6
3.1 Code Explanation
The code uses librosa to process audio files and generate Mel Spectrograms. librosa.load(fp)
loads the audio, returning the waveform and sampling rate. librosa.feature.melspectrogram()
computes the spectrogram, and librosa.power to db() converts it to a dB scale. Finally,
librosa.display.specshow() visualizes it with time and frequency axes.
7
Figure 5: Mel Spectrogram 2
8
4 Zero Crossing Rate Analysis
4.1 Concept of Zero Crossing Rate (ZCR)
The Zero Crossing Rate (ZCR) is a feature that indicates the number of times an audio
signal crosses the zero amplitude line (changes sign) per unit of time or frame. Mathemat-
ically, it is given by:
T −1
1 X
ZCR = ⊮{(xt ·xt−1 )<0}
T −1
t=1
where xt is the signal amplitude at time t, and ⊮ is an indicator function that is 1 when
the sign of the signal changes between consecutive samples.
• Low ZCR: Indicates tonal, voiced, or smooth content like vowels, musical notes, or
silence.
Analysis:
The consistently high and fluctuating ZCR suggests a noisy or irregular signal, likely from
complex environmental audio.
9
Figure 7: Zero Crossing Rate Plot 1
10
5 plt . figure ( figsize =(10 , 4) )
6 librosa . display . specshow ( tempogram , sr = sr , x_axis = ’ time ’ , y_axis = ’ tempo
’)
7 plt . title ( f ’ Tempogram { i +1} ’)
8 plt . colorbar ()
9 plt . tight_layout ()
10 plt . show ()
• Iterates over audio files, loading waveform (y) and sampling rate (sr)
• Generates and displays the tempogram with proper labels and formatting
11
Figure 9: tempogram 1
12
Figure 11: tempogram 3
all_features = X_train.columns.tolist()
batch_size = 4
13
6.2 Feature Distribution Plotting
The code processes features from X train in batches of four, plotting each feature’s distri-
bution using histograms with KDE curves. It generates subplots for each batch, producing
one figure per group to visually explore feature distributions.
14
Figure 13: second batch of features
15
distribution, variances are right-skewed with outliers, and deltas show peaked distribu-
tions—indicating stable temporal patterns. These insights guide feature selection, scaling,
and outlier handling in the modeling pipeline.
16
The visualization shows a bar chart representing the distribution of audio categories in
the training dataset. Each bar corresponds to a specific sound category, with its height
indicating the number of samples for that category.
17
plt.figure(figsize=(10, 5))
sns.barplot(x=rf_importances.index, y=rf_importances.values)
plt.title(’Top 20 Feature Importances (Random Forest)’)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
18
characteristics. Importance scores decline gradually, indicating multiple features contribute
meaningfully.
plt.figure(figsize=(10, 5))
sns.barplot(x=mi_series.index, y=mi_series.values)
plt.title(’Top 20 Feature Importances (Mutual Information)’)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
19
9.4 Analysis of the Mutual Information Chart
20
transformed as per the criteria set by the training normalization. Then we concatenated
the three to give the final csv file. We also provide code for you to break it down, which
is what we ultimately use in PCA, KMeans, DBSCAN and so on. Since the purpose of
providing features in a csv file was to make the testing process easier, we have done so.
In the .ipynb file, you will find a point where we ”begin” the clustering by extracting the
dataframe from the csv file and then breaking it down into training, validation, and test
sets.
This is not same as normalizing all 1500 files together, as that will inevitably lead to
data leakage and is not a good practice. Since while normalizing you subtract the mean
from your data point and divide it by the standard deviation, if you take that for all the
1500 files, then you have some information about the test and validation sets because the
mean and standard deviation are affected by them as well. This is not good. So we split the
set, normalize the train set, force fit that normalization on the rest of the two, concatenate
them , and make the FeaturesNC.csv file !
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
This ensures that each feature has mean 0 and standard deviation 1, preventing features
with larger magnitudes from dominating the distance calculations.
Each point is colored based on its true category. The result is shown in Figure 19.
21
Figure 19: t-SNE Projection Colored by True Category
22
Observations:
The 2D projection shows partial clustering, with some category overlap suggesting limited
class separability. However, distinct groupings indicate that certain categories possess
unique spectral-temporal characteristics.
The resulting clusters were then projected using t-SNE and plotted in Figure 20.
Observations:
• KMeans forms visually coherent clusters in 2D space.
• Several clusters appear compact and distinct, while some overlap, indicating mixed-
category grouping.
• The structure in clustering reveals some inherent patterns learned from the features,
despite being unsupervised.
23
11.4 Significance
t-SNE and KMeans together enable effective exploration of high-dimensional audio data,
offering insights into feature quality, class separability, and the potential for clustering-
based classification.
• b(i) is the mean nearest-cluster distance (to the closest different cluster).
24
12.3 Silhouette Coefficient Distribution
Figure 21: Silhouette Coefficient Distribution for KMeans Clustering (50 Clusters)
Interpretation
Most silhouette scores lie between 0 and 0.1, indicating weak cohesion. A negative tail
suggests some misclustered points, and few samples exceed 0.3—implying limited cluster
separation with k = 50.
12.4 Significance
Silhouette analysis provides a quantitative check on clustering quality, revealing that while
some structure exists, improvements in feature representation or cluster tuning are needed.
Similarly, we have also performed a cluster-mapping of the features but that did not
reveal any specific observation since it was too complicated.
25
• The algorithm uses Lloyd’s method, alternating between assignment and centroid
update steps.
Figure 22: This shows how around 90 percent of the variance is explained as soon as we
reach 60 components
26
13.3 KMeans Clustering Visualization
Figure 23: KMeans Cluster Assignment Visualized via First Two PCA Components
Observations:
Dense clusters near the origin indicate strong cohesion, while overlapping regions and
scattered outliers suggest the presence of low-density classes or anomalies in the feature
space.
27
KMeans | n_clusters=40 → ARI=0.0668
KMeans | n_clusters=45 → ARI=0.0877
KMeans | n_clusters=50 → ARI=0.0784
KMeans | n_clusters=55 → ARI=0.0912
KMeans | n_clusters=60 → ARI=0.0837
Conclusion:
• Best value of k was found to be 55, with a validation ARI of 0.0912.
• This tuning is crucial in unsupervised setups to balance cluster granularity and ac-
curacy.
Core Definitions
• Core Point: A point x is a core point if there are at least min samples points within
distance ε of it.
• Border Point: Lies within ε of a core point but doesn’t satisfy core conditions itself.
• Noise Point: Neither a core nor reachable from a core — treated as outlier.
28
14.3 Custom Implementation
def dbscan(X, eps=0.5, min_samples=5):
...
for i in range(n):
if len(neighbors[i]) < min_samples:
labels[i] = -1 # Noise
else:
# Expand cluster from core
Interpretation:
• Majority of points were successfully clustered.
• Only a small fraction was considered noise — good signal density.
• High ARI was achieved when ε was tighter and density threshold was relaxed (min samples=3),
capturing fine-grained patterns.
29
14.6 Conclusion
DBSCAN provides a density-aware alternative to centroid-based clustering. Though it
yielded a lower ARI than KMeans, it robustly identifies outliers and is capable of modeling
non-convex clusters — making it valuable in high-noise or heterogeneous datasets .
Test ARI (DBSCAN with eps=5, min samples=3): 0.1053
15 Comp part
Note : We were told quite late that the format of how to submit the features of our
competitive models, by Atul sir; by that time, we had already named out paths and files
something else and they were pretty scattered because we used two different models and
fused them. Therefore our files are in the FeaturesC folder of our zipfile. The files have
been clearly mentioned in the .ipynb files as well. Apologies for the trouble caused, it will
hardly take 2 minutes to upload those very light .npy files.
15.1 Introduction
This document provides a comprehensive analysis of a Python script designed for an audio
processing and classification pipeline. The code performs the following tasks:
• Installing and loading two deep learning based pre-trained models: YAMNet (for
audio embeddings) and AST (Audio Spectrogram Transformer) for additional feature
extraction.
• Extracting embeddings from audio files in both training and test datasets.
• Reducing the dimensionality of the fused features using tuned PCA (searching ideal
n over a range).
30
15.2 Step-by-Step Code Analysis
15.2.1 Mount Drive & Setup
Explanation:
Simple stuff, mounting the drive where we kept the datasets, putting up their addresses,
and importing essential libraries
31
Step 1: Audio Preprocessing: The input is a raw audio waveform, x(t). The first step
involves computing a log-scaled mel spectrogram, which represents the short-term power
spectrum of the audio:
!
X
2
X[m, t] = log x (t + n) ϕm (n) + ϵ , (1)
n
where ϕm (n) denotes the m-th mel filter, and ϵ is a small constant added for numerical
stability.
Step 2: Convolutional Feature Extraction: The computed log mel spectrogram is
then fed into a series of convolutional layers. These layers often use depthwise separa-
ble convolutions to reduce the number of parameters while retaining performance. Each
convolutional layer can be mathematically described as:
y(l) = f BN W (l) ∗ y(l−1) + b(l) , (2)
where:
• y(l−1) is the input to the l-th layer (with y(0) = X[m, t]),
• W (l) and b(l) are the weights and biases for the l-th layer,
• ∗ denotes the convolution operation,
• BN(·) denotes batch normalization,
• f (·) is a non-linear activation function, typically the ReLU.
Step 3: Temporal Pooling and Embedding Generation: After several convolutional
stages, the feature maps are aggregated across time by global average pooling to produce
a fixed-dimensional representation, h ∈ R1024 :
T
1 X (L)
h= yt , (3)
T
t=1
(L)
where yt represents the output of the last convolutional layer at time t, and T is the
number of time frames.
Step 4: Classification: Finally, the embedding vector h is passed through a fully-connected
(dense) layer to produce logits for each of the 521 classes:
and the class probabilities are computed via the softmax function:
exp(zi )
pi = P521 , for i = 1, 2, . . . , 521. (5)
j=1 exp(zj )
32
15.3 Mathematical Summary
To summarize, the mathematical operations performed by YAMNet are as follows:
• Input Transformation: Convert raw audio x(t) into a log mel spectrogram:
!
X
X[m, t] = log x2 (t + n) ϕm (n) + ϵ .
n
exp(zi )
z = Wlogits h + blogits , pi = P521 .
j=1 exp(zj )
Explanation:
33
6 return None
7 _ , embeddings , _ = yamnet_model ( audio )
8 return np . mean ( embeddings . numpy () , axis =0) # (1024 ,)
9 except Exception as e :
10 print ( f " Error processing { wav_path }: { e } " )
11 return None
Listing 3: Extracting YAMNet Embeddings
Explanation:
a. Audio Loading: Uses librosa.load to load a WAV file and resample it to 16 kHz.
c. Model Inference: Runs the audio through YAMNet and extracts the embeddings.
d. Pooling: Computes the mean across the time dimension to obtain a fixed-length
vector (1024-dimensional).
e. Error Handling: Catches and prints any exceptions, returning None if an error
occurs.
34
Explanation:
• Reads the CSV file containing the training labels.
• Stacks the embeddings into a NumPy array and saves both the embeddings and labels
as .npy files.
Explanation:
• Lists all WAV files in the test folder.
• Extracts embeddings for each test file using the previously defined function.
• Collects and prints any files that were skipped due to errors.
35
15.3.5 Download .npy Files Locally (Optional)
Explanation:
Step 1: Loading the audio file and resampling it to a standard sampling rate (16 kHz).
Step 3: Feeding the spectrogram into the AST feature extractor, which prepares the input
for the transformer.
Step 4: Passing the processed spectrogram through the AST model to obtain transformer-
based embeddings.
Step 5: Extracting the embedding corresponding to the [CLS] token, which serves as a
compact representation of the audio.
36
2. Spectrogram Feature Extraction:
The resampled waveform is then passed to the ASTFeatureExtractor. Internally,
this extractor computes a log-scaled mel spectrogram from the waveform. This spec-
trogram is analogous to an image, where the time and frequency dimensions represent
the two axes.
where ϕf (n) represents the mel filter for frequency bin f , and ϵ is a small constant for
numerical stability.
*2. Patch Embedding
The spectrogram S is divided into N patches. Each patch Si is flattened into a vector
si ∈ RP (with P being the patch size). A linear projection is then applied to each patch:
z i = W p s i + bp ,
where Wp ∈ RD×P is the projection matrix, bp ∈ RD is the bias, and D is the embedding
dimension.
*3. Positional Embedding and Input to Transformer
37
A learnable positional embedding Epos ∈ R(N +1)×D is added to the sequence of patch
embeddings. A special [CLS] token with embedding zcls ∈ RD is prepended to the sequence:
Z0 = [zcls ; z1 ; z2 ; . . . ; zN ] + Epos ,
Explanation:
• Loads the AST feature extractor and model that has been fine-tuned on AudioSet.
• Determines whether a CUDA-enabled GPU is available and moves the model to the
appropriate device.
38
15.3.10 Define AST Embedding Extraction Function
Explanation:
c. Feature Extraction: Uses the AST feature extractor to prepare inputs for the
model.
d. Model Inference: Runs the AST model to obtain the output embeddings, taking
the embedding corresponding to the [CLS] token.
39
7 X . append ( emb )
8 y . append ( row [ ’ category ’ ])
9
10 X = np . vstack ( X )
11 y = np . array ( y )
12 print ( " Train shape : " , X . shape )
13
14 np . save ( ’ X_train . npy ’ , X )
15 np . save ( ’ y_train . npy ’ , y )
Listing 9: Extracting AST Embeddings for Training Data
Explanation:
• Stacks the embeddings and saves them along with the labels.
Explanation:
• Processes the test set similarly to the training data, extracting AST embeddings.
• Applies scaling and PCA transformation (after having fitted these on the training
set).
40
16 Deep Audio Representation Fusion
16.1 YAMNet Architecture Specifications
• Input: 96ms frames (15600 samples @ 16kHz)
• Layer decomposition:
16.2 Fusion
Late fusion of embeddings via concatenation:
L = αLCE + (1 − α)Ltriplet
17 Comparative Analysis
41
• Fixed Feature Extraction Process: These features are computed using fixed
mathematical formulas. They lack the ability to adapt to different audio tasks or
datasets, which may lead to suboptimal performance in diverse scenarios.
• Manual Tuning and Domain Expertise: The design and selection of these fea-
tures require significant domain expertise and manual tuning. In contrast, deep
learning models automatically learn the best representations from the data.
• Robustness to Variations:
Deep learning models trained on large-scale datasets inherently learn to generalize
over a wide variety of acoustic environments. This results in representations that are
robust to background noise, variations in recording conditions, and other distortions
that can degrade handcrafted features.
42
from data. This means that the models can capture subtle nuances in the audio that
might be missed by traditional methods.
• Redundancy of Traditional Features:
Since the deep models provide a rich and comprehensive representation of the audio
signal, they subsume the information captured by traditional features. Consequently,
incorporating MFCC, spectral, or chroma features becomes redundant. The fusion
of YAMNet and AST embeddings effectively encapsulates both the local and global
characteristics of the audio, rendering additional handcrafted features unnecessary.
43
18.3 Scale the Fused Embeddings
1 scaler = StandardScaler ()
2 X_train_scaled = scaler . fit_transform ( X_train_fused )
3 X_test_scaled = scaler . transform ( X_test_fused )
Listing 13: Standardization
Explanation:
• Applies standard scaling to the fused training and test features to normalize the
feature distributions.
44
20.2 Tuning Methodology
The optimization process follows:
RI − E[RI]
ARI = (11)
max(RI) − E[RI]
• Evaluation metric (ARI) measures cluster similarity between predicted and true labels
45
4 bes t_n_components = 0
5
6 for n in [60 , 80 , 100 , 110 , 120 , 130 , 140 , 150 , 160 , 170 , 180 , 190 , 200 ,
210 , 220]:
7 pca = PCA ( n_components =n , random_state =42)
8 X_pca_candidate = pca . fit_transform ( X_train_scaled )
9
10 # Quick baseline check on each dimension with simple LR
11 X_train_p , X_val_p , y_train_p , y_val_p = train_test_split (
12 X_pca_candidate , y_encoded ,
13 test_size =0.2 , stratify = y_encoded , random_state =42
14 )
15
16 # Quick logistic baseline
17 clf = LogisticRegression ( max_iter =2000 , solver = ’ saga ’)
18 clf . fit ( X_train_p , y_train_p )
19 ari_candidate = adjusted_rand_score ( y_val_p , clf . predict ( X_val_p ) )
20
21 print ( f " PCA { n } D ARI = { ari_candidate :.4 f } " )
22
23 if ari_candidate > best_ari :
24 best_ari = ari_candidate
25 best_X_pca = X_pca_candidate
26 best_pca_model = pca
27 best_n_components = n
28
29 print ( f " \ n Best PCA Dim = { best_n_components } , baseline ARI = { best_ari
:.4 f } " )
Listing 14: PCA Tuning
Explanation: This code block performs hyperparameter tuning for Principal Compo-
nent Analysis (PCA) by trying out different numbers of components and evaluating the
performance of a simple Logistic Regression classifier. The objective is to select the PCA
dimension that results in the best baseline performance as measured by the Adjusted Rand
Index (ARI). Below is a detailed breakdown of the process:
Step 1: Initialization:
• best ari is initialized to 0. This variable will store the highest ARI score observed.
• best pca model is set to None and will later hold the PCA model with the optimal
number of components.
• best X pca will store the transformed training data corresponding to the best PCA
model.
• best n components is initialized to 0 and will record the optimal number of PCA
components.
Step 2: Loop Over Candidate PCA Dimensions:
46
• The code iterates over a list of candidate dimensions: [60, 80, 100, 110, 120,
130, 140, 150, 160, 170, 180, 190, 200, 210, 220].
• For each candidate value n, a PCA model is instantiated with n components=n and a
fixed random state for reproducibility.
• The PCA model is then fitted to the standardized training data (X train scaled)
and used to transform it, producing X pca candidate.
• The transformed data is split into a training and validation set using train test split.
The split is stratified based on the encoded labels (y encoded) to preserve the class
distribution.
• A Logistic Regression classifier is instantiated with a maximum of 2000 iterations and
the saga solver, which is suitable for large datasets.
• The classifier is trained on the training split (X train p and y train p) and its per-
formance is evaluated on the validation set using the Adjusted Rand Index (ARI) as
the metric.
• After the loop completes, the code prints the optimal PCA dimensionality and the
corresponding baseline ARI.
1 X_pca_trainval = best_X_pca
2 X_train , X_val , y_train_ , y_val_ = train_test_split (
3 X_pca_trainval , y_encoded ,
4 test_size =0.2 , stratify = y_encoded , random_state =42
5 )
Listing 15: Splitting Data
47
Explanation:
• Splits the PCA-transformed training data into training and validation sets for sub-
sequent model tuning.
where k = 3 folds in the code. The search space uses Cartesian product:
m
(1) (2)
Y
Θgrid = {θi , θi , ...} (14)
i=1
22 Random Forest
22.1 Algorithm
• Ensemble of B decision trees: {Tb (x)}B
b=1
48
23 Logistic Regression
23.1 Mathematical Formulation
Multinomial logistic regression with:
T
ewc x
P (y = c|x) = PK (17)
wkT x
k=1 e
24 XGBoost
24.1 Model Definition
Gradient boosted trees with additive functions:
T
X
ŷi = ft (xi ), ft ∈ F (19)
t=1
49
25 Code Implementation Strategy
The tuning process combines three mathematical approaches:
2. Validation Metric:
nval
1 X
Accuracy = I(ŷi = yi ) (22)
nval
i=1
3. Ensemble Foundation:
ŷfinal = Vote fθRF
∗ , fθ ∗ , fθ ∗
LR XGB
(23)
1 rf_grid = {
2 ’ n_estimators ’: [100 , 200] ,
3 ’ max_depth ’: [20 , None ] ,
4 ’ min_samples_split ’: [2 , 5]
5 }
50
6 rf_search = GridSearchCV ( Ra nd omF or es tC las si fi er ( random_state =42) ,
7 rf_grid ,
8 cv =3 , scoring = ’ accuracy ’ , n_jobs = -1)
9 rf_search . fit ( X_train , y_train_ )
10 clf_rf = rf_search . best_estimator_
11 print (" RF Best Params :" , rf_search . best_params_ )
51
6 }
7 lr_search = GridSearchCV ( Logi sticRegression () ,
8 lr_grid ,
9 cv =3 , scoring = ’ accuracy ’ , n_jobs = -1)
10 lr_search . fit ( X_train , y_train_ )
11 clf_lr = lr_search . best_estimator_
12 print (" LR Best Params :" , lr_search . best_params_ )
29 XGBoost Tuning
1 xgb_grid = {
2 ’ n_estimators ’: [100 , 200] ,
3 ’ max_depth ’: [6 , 10] ,
4 ’ learning_rate ’: [0.05 , 0.1]
5 }
6 xgb_base = xgb . XGBClassifier (
7 objective = ’ multi : softprob ’ ,
8 num_class = len ( np . unique ( y_encoded ) ) ,
9 eval_metric = ’ mlogloss ’ ,
10 use_label_encoder = False ,
11 random_state =42
12 )
13
14 xgb_search = GridSearchCV ( xgb_base ,
52
15 xgb_grid ,
16 cv =3 , scoring = ’ accuracy ’ , n_jobs = -1)
17 xgb_search . fit ( X_train , y_train_ )
18 clf_xgb = xgb_search . best_estimator_
19 print (" XGB Best Params :" , xgb_search . best_params_ )
20 %
53
Table 3: Hyperparameter Search Spaces
Model Parameters Values
Random Forest n estimators {100, 200}
max depth {20, None}
Logistic Regression C {0.1, 1, 10}
XGBoost learning rate {0.05, 0.1}
54
30.3.4 Regularization Balance
• XGBoost:
1
Ω(f ) = γT + λ∥w∥2 (36)
2
• Random Forest:
n
1 X OOB
OOB Error = I(ŷi ̸ yi )
= (37)
n
i=1
30.3.5 Conclusion
The combination works because:
Where:
55
31.1.2 Code Implementation
56
31.4 Performance Validation
Adjusted Rand Index (ARI) calculation:
Agreement − Expected Agreement
ARI =
Max Agreement − Expected Agreement
n
2 (a + d) − [(a + b)(a + c) + (c + d)(b + d)]
= n 2
2 − [(a + b)(a + c) + (c + d)(b + d)]
This section describes how an ensemble classifier is built using the soft voting strategy.
The ensemble combines three base classifiers — Random Forest (RF), Logistic Regression
(LR), and XGBoost (XGB) — whose best hyperparameters were obtained through prior
grid search tuning. The ensemble is evaluated on a validation set using the Adjusted Rand
Index (ARI) metric.
Explanation
• Predicting on the Validation Set:
– The ensemble classifier makes predictions on the validation set (X val) by com-
bining the probability outputs of each base model.
57
– The adjusted rand score function is used to evaluate the clustering perfor-
mance by comparing the true labels (y val ) with the predicted labels (y val pred).
– ARI is a metric that measures the similarity between two data clusterings, ad-
justed for chance. A higher ARI indicates better agreement between the pre-
dicted clusters and the true labels.
• Result Output:
Explanation:
The test set is transformed using the trained PCA model, and predictions are made
using the ensemble classifier. These are converted back to categorical labels and saved as
a submission-ready CSV file.
58
31.7 Discussion of Results
So , it is quite obvious that our strategy is quite good and has outperformed other possible
strategies that we played with : of course, people have outperformed us and there is scope
for improvement !
Cheers !
59