Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views59 pages

ML Assignment 2 Report

This report details the implementation of an audio processing pipeline for clustering and classification, focusing on feature extraction from audio files. Key features extracted include MFCCs, Zero Crossing Rate, and spectral characteristics, which are essential for machine learning tasks. The report also includes waveform and Mel spectrogram analyses, providing insights into the audio's temporal and frequency characteristics.

Uploaded by

Rohit Jadekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views59 pages

ML Assignment 2 Report

This report details the implementation of an audio processing pipeline for clustering and classification, focusing on feature extraction from audio files. Key features extracted include MFCCs, Zero Crossing Rate, and spectral characteristics, which are essential for machine learning tasks. The report also includes waveform and Mel spectrogram analyses, providing insights into the audio's temporal and frequency characteristics.

Uploaded by

Rohit Jadekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Audio Clustering Track 1

Rohit Yuvaraj Jadekar Akarsh Gupta


Entry No.: 2022MT61984 Entry No.: 2022MT61966

June 19, 2025

1 Non-Comp Part
This report analyzes the shared code implementation for audio signal processing and the
corresponding waveform visualizations.

1.1 Code Analysis and Framework


The provided code implements a complete audio processing pipeline for clustering and clas-
sification. The implementation follows a structured approach with several key components:

1.1.1 Setup and Library Configuration


from google.colab import drive
drive.mount(’/content/drive’)

# Fix numpy-librosa compatibility


import numpy as np
np.complex = complex

The code begins by mounting Google Drive for data access and resolving a compatibility
issue between NumPy and Librosa libraries. This is a common workaround when working
with audio processing libraries in Colab environments.

2 Feature Extraction from Audio Files


The code extracts meaningful features from audio files, condensing raw audio signals into
descriptive numerical representations for further analysis. Below is a summary of the
process:

1
2.1 Feature Extraction Function

1 # STEP 1: FEATURE EXTRACTION


2 import librosa
3 import numpy as np
4 from scipy . stats import skew , kurtosis
5
6 import librosa
7 import numpy as np
8
9 def extract_features ( file_path ) :
10 y , sr = librosa . load ( file_path , sr = None )
11 features = {}
12
13 # --- MFCCs ---
14 mfccs = librosa . feature . mfcc ( y =y , sr = sr , n_mfcc =20)
15 mfccs_delta = librosa . feature . delta ( mfccs )
16 mfccs_delta2 = librosa . feature . delta ( mfccs , order =2)
17
18 for i in range (20) :
19 features [f ’ mfcc_ { i +1} _mean ’] = np . mean ( mfccs [ i ])
20 features [f ’ mfcc_ { i +1} _var ’] = np . var ( mfccs [ i ])
21 features [f ’ delta_ { i +1} _mean ’] = np . mean ( mfccs_delta [ i ])
22 features [f ’ delta2_ { i +1} _mean ’] = np . mean ( mfccs_delta2 [ i ])
23
24 # --- Zero Crossing Rate ---
25 zcr = librosa . feature . zero_crossing_rate ( y ) [0]
26 features [ ’ zcr_mean ’] = np . mean ( zcr )
27 features [ ’ zcr_var ’] = np . var ( zcr )
28
29 # --- Spectral Centroid ---
30 sc = librosa . feature . spectral_centroid ( y =y , sr = sr ) [0]
31 features [ ’ spec_centroid_mean ’] = np . mean ( sc )
32 features [ ’ spec_centroid_var ’] = np . var ( sc )
33
34 # --- Spectral Bandwidth ---
35 bw = librosa . feature . spectral_bandwidth ( y =y , sr = sr ) [0]
36 features [ ’ spec_bw_mean ’] = np . mean ( bw )
37 features [ ’ spec_bw_var ’] = np . var ( bw )
38
39 # --- Root Mean Square Energy ---
40 rms = librosa . feature . rms ( y = y ) [0]
41 features [ ’ rms_mean ’] = np . mean ( rms )
42 features [ ’ rms_var ’] = np . var ( rms )
43
44 # --- Tempo and IOI ( Inter - Onset Interval ) ---
45 onset_env = librosa . onset . onset_strength ( y =y , sr = sr )
46 onset_frames = librosa . onset . onset_detect ( onset_envelope = onset_env , sr =
sr )
47 onset_times = librosa . frames_to_time ( onset_frames , sr = sr )

2
48 iois = np . diff ( onset_times )
49
50 features [ ’ tempo ’] = librosa . beat . tempo ( onset_envelope = onset_env , sr = sr )
[0]
51 features [ ’ ioi_mean ’] = np . mean ( iois ) if len ( iois ) > 0 else 0
52 features [ ’ ioi_var ’] = np . var ( iois ) if len ( iois ) > 0 else 0
53
54 return features

The extract features(file path) function performs the following operations:

1. MFCCs and Derivatives: Computes 20 MFCCs (Mel-Frequency Cepstral Coeffi-


cients) along with their first-order (delta) and second-order (delta-delta) derivatives.
These capture timbral properties and spectral changes over time.

2. Why Use 20 MFCCs?


MFCCs (Mel-Frequency Cepstral Coefficients) capture the short-term spectral shape
of an audio signal using the Mel-scaled power spectrum and Discrete Cosine Trans-
form (DCT).

2.2 Rationale for 20 MFCCs


• Lower MFCCs (1st–4th): Capture energy, loudness, and pitch.
• Middle MFCCs (5th–13th): Represent timbre, voice type, and tone.
• Higher MFCCs (14th onward): Add finer spectral details but are prone to
noise.

Using 20 MFCCs balances broad tone representation with moderate fine-grain detail,
making it ideal for non-speech audio with diverse timbral variations like snoring or
rain.

3. Zero Crossing Rate (ZCR): Measures how often the signal changes sign, useful
for detecting noisiness and tonal quality.

4. Spectral Features: Computes the mean and variance of spectral centroid (bright-
ness) and spectral bandwidth (frequency spread).

5. Root Mean Square (RMS) Energy: Quantifies signal power (loudness) over
time.

6. Tempo and Inter-Onset Interval (IOI): Detects onset times, computes inter-
onset intervals, and estimates tempo to capture rhythmic properties.

3
2.3 Batch Processing
The function is applied to all audio files listed in df labels. For each file:

• Features are extracted and stored in a dictionary.

• Metadata such as filename and category are appended.

• Results are saved to a CSV file for further analysis.

2.4 Conceptual Overview


This process condenses raw audio data into perceptually relevant features, enabling machine
learning tasks like classification or clustering. Key benefits include:

1. Dimensionality reduction from raw signals to meaningful statistics.

2. Statistical characterization of audio attributes (mean, variance).

3. Capturing rhythmic properties through tempo and IOI analysis.

2.5 Waveform Analysis


The code includes visualization of waveforms with:

sample_paths = df_labels.iloc[:900].reset_index(drop=True)[’filename’].iloc[:5].apply(
lambda fn: os.path.join(TRAIN_FOLDER, fn))
for i, fp in enumerate(sample_paths):
y, sr = librosa.load(fp)
plt.figure(figsize=(12, 3))
librosa.display.waveshow(y, sr=sr)
plt.title(f’Waveform {i+1}’)
plt.tight_layout()
plt.show()

This section loads five sample files from the training set and displays their waveforms
to understand the temporal characteristics of the audio data.

2.5.1 Analysis of Provided Waveform Images


Waveform 1: This waveform demonstrates clear transient characteristics with 5 distinct
peaks occurring at approximately t=0.0s, t=0.3s, t=0.6s, t=0.8s, and t=1.1s. Each peak
shows rapid amplitude changes reaching nearly ±1.0, followed by a natural decay pattern.
The signal energy diminishes significantly after t=1.3s, becoming virtually silent. This

4
Figure 1: Percussive audio waveform with distinct transients and quick decay.

Figure 2: Sustained audio waveform with maintained energy throughout the duration.

Figure 3: Rhythmic audio waveform with regularly spaced transient peaks.

5
pattern is highly characteristic of percussive sounds like drumbeats, claps, or impact sounds
where energy quickly dissipates after initial excitation.
Waveform 2: This waveform exhibits significantly different characteristics with sus-
tained energy throughout the 4.8-second duration. It features strong initial transients
similar to Waveform 1, but notably contains extended periods of activity between t=0.3s
and t=0.8s and again around t=1.3s-1.5s. Unlike Waveform 1, the signal maintains no-
ticeable amplitude throughout the recording, though gradually decreasing over time. This
pattern suggests continuous audio content such as speech, singing, or sustained musical
passages with complex harmonic content.
Waveform 3: This waveform presents a more regular, rhythmic pattern with sharp
transient peaks occurring at predictable intervals across the entire duration. Significant
peaks appear at approximately t=0.6s, t=0.8s, t=1.5s, t=1.8s, t=2.4s, t=3.3s, t=3.6s,
t=4.2s, and t=4.6s. The consistency in both timing and amplitude suggests structured,
rhythmic content - possibly metronome clicks, rhythmic percussion, or structured speech
with regular emphasis patterns. Unlike the previous waveforms, the amplitude remains
relatively consistent throughout the recording without significant decay.

2.6 Conceptual Framework


The audio analysis pipeline begins with waveform representation, capturing amplitude
variations over time. It then extracts engineered features like ZCR (time-domain) and
MFCCs, spectral centroid, bandwidth (frequency-domain) for machine learning suitability.
Finally, statistical aggregation (mean, variance) converts variable-length signals into fixed-
length feature vectors.

3 Mel Spectrogram Analysis


1 # === 2. Mel Spectrograms ===
2 for i , fp in enumerate ( sample_paths ) :
3 y , sr = librosa . load ( fp )
4 S = librosa . feature . melspectrogram ( y =y , sr = sr )
5 S_dB = librosa . power_to_db (S , ref = np . max )
6 plt . figure ( figsize =(10 , 4) )
7 librosa . display . specshow ( S_dB , x_axis = ’ time ’ , y_axis = ’ mel ’ , sr = sr )
8 plt . colorbar ( format = ’%+2.0 f dB ’)
9 plt . title (f ’ Mel Spectrogram { i +1} ’)
10 plt . tight_layout ()
11 plt . show ()
12 The following code generates and displays Mel Spectrograms for a given set
of audio files .

6
3.1 Code Explanation
The code uses librosa to process audio files and generate Mel Spectrograms. librosa.load(fp)
loads the audio, returning the waveform and sampling rate. librosa.feature.melspectrogram()
computes the spectrogram, and librosa.power to db() converts it to a dB scale. Finally,
librosa.display.specshow() visualizes it with time and frequency axes.

3.2 Mel Spectrogram Concept


A Mel Spectrogram is a visual representation of an audio signal’s frequency content over
time. The x-axis represents time, the y-axis represents the Mel frequencies, and the color
intensity represents the energy level at each time-frequency point.
The Mel scale is designed to reflect the way humans perceive sound, where lower fre-
quencies are linearly spaced, and higher frequencies are spaced logarithmically.

3.3 Analysis of the Spectrogram Images

Figure 4: Mel Spectrogram 1

3.3.1 Mel Spectrogram Interpretations


Image 1: Represents a simple, periodic sound with energy concentrated in vertical bands.
Image 2: Displays a complex signal with energy spread over time and frequency.
Image 3: Shows periodic patterns with varying intensity, indicating repeated sound events.

7
Figure 5: Mel Spectrogram 2

Figure 6: Mel Spectrogram 3

3.4 Significance of Mel Spectrograms


Mel Spectrograms are a useful representation of audio data, as they capture the frequency
distribution of an audio signal in a human-perceptible scale. They are commonly used
in machine learning tasks like audio classification, speech recognition, and sound event
detection.

8
4 Zero Crossing Rate Analysis
4.1 Concept of Zero Crossing Rate (ZCR)
The Zero Crossing Rate (ZCR) is a feature that indicates the number of times an audio
signal crosses the zero amplitude line (changes sign) per unit of time or frame. Mathemat-
ically, it is given by:
T −1
1 X
ZCR = ⊮{(xt ·xt−1 )<0}
T −1
t=1

where xt is the signal amplitude at time t, and ⊮ is an indicator function that is 1 when
the sign of the signal changes between consecutive samples.

• High ZCR: Indicates noisy or high-frequency content, such as snoring, static, or


consonants in speech.

• Low ZCR: Indicates tonal, voiced, or smooth content like vowels, musical notes, or
silence.

4.2 Python Code to Plot ZCR


for i, fp in enumerate(sample_paths):
y, sr = librosa.load(fp)
zcr = librosa.feature.zero_crossing_rate(y)[0]
plt.figure(figsize=(10, 3))
plt.plot(zcr)
plt.title(f’Zero Crossing Rate {i+1}’)
plt.tight_layout()
plt.show()

4.3 ZCR Plot Analysis


Analysis:
The ZCR plot begins with periodic activity, followed by a sharp transient spike and a long
near-zero region—indicating an impulse-like sound followed by silence or smoothness.

Analysis:
The consistently high and fluctuating ZCR suggests a noisy or irregular signal, likely from
complex environmental audio.

9
Figure 7: Zero Crossing Rate Plot 1

Figure 8: Zero Crossing Rate Plot 2

4.4 Significance of ZCR in Audio Analysis


ZCR is a simple yet powerful feature used to distinguish between voiced/unvoiced sounds,
noise, and silence—making it valuable for speech and audio classification tasks.

5 Tempograms in Audio Analysis


5.1 Code Explanation
The provided code utilizes the librosa library to generate tempograms for a collection of
audio files:
1 for i , fp in enumerate ( sample_paths ) :
2 y , sr = librosa . load ( fp )
3 oenv = librosa . onset . onset_strength ( y =y , sr = sr )
4 tempogram = librosa . feature . tempogram ( onset_envelope = oenv , sr = sr )

10
5 plt . figure ( figsize =(10 , 4) )
6 librosa . display . specshow ( tempogram , sr = sr , x_axis = ’ time ’ , y_axis = ’ tempo
’)
7 plt . title ( f ’ Tempogram { i +1} ’)
8 plt . colorbar ()
9 plt . tight_layout ()
10 plt . show ()

• Iterates over audio files, loading waveform (y) and sampling rate (sr)

• Computes onset strength envelope (oenv) to capture rhythmic activity

• Generates and displays the tempogram with proper labels and formatting

5.2 The Concept of Tempograms


A tempogram is a time-tempo representation that visualizes the periodic structure of an au-
dio signal across time. Conceptually, it’s similar to a spectrogram but focuses on rhythmic
information rather than frequency content.
Key aspects of tempograms include:

• X-axis: Represents time progression through the audio

• Y-axis: Represents tempo in beats per minute (BPM)

• Color intensity: Indicates the likelihood or strength of a particular tempo at each


time point

Tempograms are created by analyzing the autocorrelation of onset strength envelopes


or other methods that can detect periodicities in the signal. They provide valuable insights
into rhythmic patterns, tempo changes, and overall tempo stability in musical pieces.

5.3 Analysis of Provided Tempograms


5.3.1 Tempogram Interpretations
Tempogram 1: Shows a stable rhythm with a strong high-tempo component ( 256 BPM)
and low energy elsewhere.
Tempogram 2: Highlights strong high-tempo energy with a clear intensity gradient over
time, suggesting dynamic rhythmic activity.
Tempogram 3: Displays multiple rhythmic layers with consistent bands at different tem-
pos, indicating complex but steady rhythmic structure.

11
Figure 9: tempogram 1

Figure 10: tempogram 2

5.4 Significance of Tempogram Analysis


Tempograms reveal tempo and rhythmic patterns, aiding in tempo detection, rhythm clas-
sification, structural segmentation, genre identification, and performance analysis across
musical or audio content.

12
Figure 11: tempogram 3

6 Histograms of Feature Distributions


6.1 Code Analysis
The provided code creates histograms to visualize the distribution of features extracted
from audio data:

# === 5. Histograms of Feature Distributions ===


import matplotlib.pyplot as plt
import seaborn as sns

all_features = X_train.columns.tolist()
batch_size = 4

for i in range(0, len(all_features), batch_size):


batch_feats = all_features[i:i+batch_size]
plt.figure(figsize=(16, 4))
for j, feat in enumerate(batch_feats):
plt.subplot(1, batch_size, j + 1)
sns.histplot(X_train[feat], kde=True, bins=30)
plt.title(feat)
plt.tight_layout()
plt.show()

13
6.2 Feature Distribution Plotting
The code processes features from X train in batches of four, plotting each feature’s distri-
bution using histograms with KDE curves. It generates subplots for each batch, producing
one figure per group to visually explore feature distributions.

6.3 Understanding the MFCC Histograms


The images show distributions of four related audio features:

6.3.1 MFCC Feature Summary


MFCC 1 mean: Approximately normal distribution centered near -375, indicating con-
sistency across samples.
MFCC 1 var: Right-skewed with most values below 20,000, suggesting generally low vari-
ance with some outliers.
delta 1 mean: Symmetric around 0, showing little net directional change in MFCC dy-
namics.
delta2 1 mean: Narrow peak around 0.1, indicating consistent acceleration in MFCC
variation.

6.4 MFCC Feature Significance


MFCCs capture perceptually meaningful spectral information, mimicking human auditory
response. Delta and delta-delta coefficients provide temporal dynamics, making MFCCs
essential for tasks like speech and music classification.

6.5 Insights from the Distributions

Figure 12: first batch of features

14
Figure 13: second batch of features

Figure 14: third batch of features

Figure 15: fourth batch of features

6.6 Insights from Feature Distributions


The histograms highlight key preprocessing considerations: MFCC means and deltas dif-
fer in scale, underscoring the need for normalization. MFCC means follow a normal

15
distribution, variances are right-skewed with outliers, and deltas show peaked distribu-
tions—indicating stable temporal patterns. These insights guide feature selection, scaling,
and outlier handling in the modeling pipeline.

7 Category Distribution in Audio Classification


7.1 Code Analysis
The following code generates a visualization of the class distribution in the training dataset:
# === 6. Category Distribution ===
plt.figure(figsize=(10, 4))
sns.countplot(x=y_train)
plt.title(’Class Distribution (Training Set)’)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

7.2 Category Distribution Plot


The code uses seaborn.countplot() to visualize category frequencies in y train. It
creates a 10 × 4 inch figure, rotates x-axis labels for clarity, applies tight layout() to
optimize spacing, and displays the finalized plot.

7.3 Analysis of the Category Distribution Plot

Figure 16: Distribution of sound categories in the training dataset

16
The visualization shows a bar chart representing the distribution of audio categories in
the training dataset. Each bar corresponds to a specific sound category, with its height
indicating the number of samples for that category.

7.4 Category Distribution Insights


The dataset spans around 40 sound categories, with most having 15–20 samples. High-
frequency categories include footsteps, vacuum cleaner, and pig, while fewer samples
appear for brushing teeth and can opening. Overall, the distribution is relatively bal-
anced with no extreme class imbalances.

7.5 Significance of the Distribution


This visualization serves several crucial purposes in the machine learning pipeline:

7.6 Dataset Assessment and Design Insights


The dataset is relatively balanced across categories, aiding fair model training and mini-
mizing class bias. While most categories have adequate sample counts, a few have slightly
fewer. This balance likely results from intentional curation, quality filtering, and practical
data collection constraints—reflecting good design for audio classification tasks.

7.7 Audio Feature Analysis Summary


Boxplots visualize category-wise feature separability, while pairplots of top variant features
reveal inter-feature relationships and class clustering. Additionally, identifying the top 20
features by variance highlights the most informative audio characteristics for downstream
classification.

8 Top 20 Features by Random Forest


8.1 Code Analysis
The following code extracts and visualizes feature importance from a Random Forest model
trained on audio features:

# === 10. Top 20 Features by Random Forest ===


le = LabelEncoder()
y_enc = le.fit_transform(y_train)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_enc)
rf_importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(asce

17
plt.figure(figsize=(10, 5))
sns.barplot(x=rf_importances.index, y=rf_importances.values)
plt.title(’Top 20 Feature Importances (Random Forest)’)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

8.2 Feature Importance via Random Forest


The code encodes audio category labels numerically, trains a Random Forest classifier
with 100 trees, and extracts feature importance scores. It then visualizes the top 20 most
informative audio features using a bar chart.

8.3 Interpretation of the Random Forest Feature Importance Chart

Figure 17: Top 20 Feature Importances according to Random Forest Classifier

8.4 Random Forest Feature Importance Insights


The top 20 features are dominated by MFCCs, especially mfcc 2 mean and mfcc 1 mean,
with importance scores ranging from 0.025 to 0.014. Spectral features (centroid, band-
width, flatness) and ZCR also contribute, showing that classification relies on diverse audio

18
characteristics. Importance scores decline gradually, indicating multiple features contribute
meaningfully.

8.5 Conceptual and Practical Significance


The results confirm MFCCs as key features in audio classification, with both mean and
variance-based descriptors offering value. Spectral and temporal features provide comple-
mentary information. These insights support informed feature selection and suggest that
model performance depends on capturing multiple aspects of the sound signal.

9 Top 20 Features by Mutual Information


9.1 Code Analysis
The following code calculates and visualizes the most informative features for audio clas-
sification using mutual information:

# === 11. Top 20 Features by Mutual Information ===


mi = mutual_info_classif(X_train, y_enc, random_state=42)
mi_series = pd.Series(mi, index=X_train.columns).sort_values(ascending=False).head(20)

plt.figure(figsize=(10, 5))
sns.barplot(x=mi_series.index, y=mi_series.values)
plt.title(’Top 20 Feature Importances (Mutual Information)’)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

9.2 Mutual Information Feature Ranking


The code computes mutual information scores using mutual info classif, ranks features
by their dependency with the target, and visualizes the top 20 in a bar chart.

9.3 Understanding Mutual Information


Mutual information measures the reduction in uncertainty about class labels given a fea-
ture. It captures both linear and non-linear dependencies, making it effective for identifying
features that best distinguish between audio categories.

19
9.4 Analysis of the Mutual Information Chart

Figure 18: Top 20 Feature Importances based on Mutual Information

9.5 Mutual Information Feature Insights


The top 20 features are led by MFCCs, particularly mfcc 2 mean and mfcc 1 mean, with
scores ranging from 0.6 to 0.4. Spectral features like spec bw var and spec centroid var
also contribute. The gradual score decline suggests that multiple features provide valuable
information.

9.6 Significance and Interpretation


Mutual information quantifies how much each feature reduces class uncertainty. The con-
sistent prominence of MFCCs reinforces their relevance, while inclusion of both mean and
variance features highlights the value of static and dynamic audio properties. Being model-
independent, mutual information offers robust, generalizable insights for feature selection.

10 Normalization and Formation of the CSV File


There were a few critical factors troubling us while making the csv file ; we cannot com-
pletely normalize the whole 1500 audio files together, for this will result in data leakage
we have to break them into train, validation, and test. The later two will be forcefully

20
transformed as per the criteria set by the training normalization. Then we concatenated
the three to give the final csv file. We also provide code for you to break it down, which
is what we ultimately use in PCA, KMeans, DBSCAN and so on. Since the purpose of
providing features in a csv file was to make the testing process easier, we have done so.
In the .ipynb file, you will find a point where we ”begin” the clustering by extracting the
dataframe from the csv file and then breaking it down into training, validation, and test
sets.
This is not same as normalizing all 1500 files together, as that will inevitably lead to
data leakage and is not a good practice. Since while normalizing you subtract the mean
from your data point and divide it by the standard deviation, if you take that for all the
1500 files, then you have some information about the test and validation sets because the
mean and standard deviation are affected by them as well. This is not good. So we split the
set, normalize the train set, force fit that normalization on the rest of the two, concatenate
them , and make the FeaturesNC.csv file !

11 t-SNE Visualization and KMeans Clustering


11.1 Normalization
Before applying dimensionality reduction or clustering, the data was normalized using
z-score standardization:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

This ensures that each feature has mean 0 and standard deviation 1, preventing features
with larger magnitudes from dominating the distance calculations.

11.2 t-SNE Projection by Category


To visualize the high-dimensional audio features, we used t-SNE with 2 components:

X_tsne_cat = TSNE(n_components=2, perplexity=30, random_state=42)

Each point is colored based on its true category. The result is shown in Figure 19.

21
Figure 19: t-SNE Projection Colored by True Category

22
Observations:
The 2D projection shows partial clustering, with some category overlap suggesting limited
class separability. However, distinct groupings indicate that certain categories possess
unique spectral-temporal characteristics.

11.3 t-SNE Projection with KMeans Clusters


We applied KMeans clustering on the normalized features with k = 50 clusters:

kmeans = KMeans(n_clusters=50, random_state=42)


labels_km = kmeans.fit_predict(X_train_scaled)

The resulting clusters were then projected using t-SNE and plotted in Figure 20.

Figure 20: t-SNE Projection Colored by KMeans Cluster Labels

Observations:
• KMeans forms visually coherent clusters in 2D space.

• Several clusters appear compact and distinct, while some overlap, indicating mixed-
category grouping.

• The structure in clustering reveals some inherent patterns learned from the features,
despite being unsupervised.

23
11.4 Significance
t-SNE and KMeans together enable effective exploration of high-dimensional audio data,
offering insights into feature quality, class separability, and the potential for clustering-
based classification.

12 Silhouette Analysis for KMeans Clustering


12.1 Silhouette Coefficient: Concept
The silhouette coefficient is a metric used to evaluate the quality of clustering. For a data
point i, it is defined as:
b(i) − a(i)
s(i) =
max{a(i), b(i)}
where:

• a(i) is the mean intra-cluster distance (within the same cluster),

• b(i) is the mean nearest-cluster distance (to the closest different cluster).

The silhouette score s(i) ranges from −1 to 1:

• s(i) ≈ 1: well-separated and well-clustered.

• s(i) ≈ 0: on or near the cluster boundary.

• s(i) < 0: likely misclassified.

12.2 Python Code


silhouette_vals = silhouette_samples(X_train_scaled, labels_km)
plt.figure(figsize=(10, 3))
sns.histplot(silhouette_vals, bins=30, kde=True)
plt.title(’Silhouette Coefficient Distribution (KMeans, 50 Clusters)’)
plt.tight_layout()
plt.show()

24
12.3 Silhouette Coefficient Distribution

Figure 21: Silhouette Coefficient Distribution for KMeans Clustering (50 Clusters)

Interpretation
Most silhouette scores lie between 0 and 0.1, indicating weak cohesion. A negative tail
suggests some misclustered points, and few samples exceed 0.3—implying limited cluster
separation with k = 50.

12.4 Significance
Silhouette analysis provides a quantitative check on clustering quality, revealing that while
some structure exists, improvements in feature representation or cluster tuning are needed.
Similarly, we have also performed a cluster-mapping of the features but that did not
reveal any specific observation since it was too complicated.

13 KMeans Clustering and PCA Visualization


13.1 Clustering with KMeans
KMeans is an unsupervised learning algorithm that partitions data into k clusters such that
the intra-cluster variance is minimized. Given data points x1 , . . . , xn and cluster centroids
µ1 , . . . , µk , the goal is to minimize:
n
X
∥xi − µci ∥2
i=1

where ci is the index of the closest cluster center to xi .

25
• The algorithm uses Lloyd’s method, alternating between assignment and centroid
update steps.

• Convergence is typically fast, especially when initialized with k-means++.

• The number of clusters k must be specified manually.

13.2 Dimensionality Reduction with PCA


To visualize the clustering structure, PCA was used to reduce the 50-dimensional data into
2D for plotting:

PCA(X) = XW where W contains the top 2 eigenvectors

Figure 22: This shows how around 90 percent of the variance is explained as soon as we
reach 60 components

The mathematics behind PCA has been explained in detail in

26
13.3 KMeans Clustering Visualization

Figure 23: KMeans Cluster Assignment Visualized via First Two PCA Components

Observations:
Dense clusters near the origin indicate strong cohesion, while overlapping regions and
scattered outliers suggest the presence of low-density classes or anomalies in the feature
space.

13.4 Adjusted Rand Index (ARI) Evaluation


val_ari = adjusted_rand_score(y_val, val_preds)
test_ari = adjusted_rand_score(y_test, test_preds)
• ARI on validation set: 0.0784
• ARI on test set: e.g., 0.1050
• ARI corrects for random chance and is robust to label permutations.

13.5 Hyperparameter Tuning


We performed a grid search over k ∈ {40, 45, 50, 55, 60} and selected the k maximizing
validation ARI.

27
KMeans | n_clusters=40 → ARI=0.0668
KMeans | n_clusters=45 → ARI=0.0877
KMeans | n_clusters=50 → ARI=0.0784
KMeans | n_clusters=55 → ARI=0.0912
KMeans | n_clusters=60 → ARI=0.0837

Conclusion:
• Best value of k was found to be 55, with a validation ARI of 0.0912.

• Performance degrades slightly beyond this, suggesting cluster splitting or overfitting.

• This tuning is crucial in unsupervised setups to balance cluster granularity and ac-
curacy.

14 Density-Based Clustering with DBSCAN


14.1 Algorithmic Overview
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a non-parametric
clustering algorithm that identifies dense regions in feature space and separates them from
sparse (noise) regions. Unlike KMeans, DBSCAN does not require the number of clusters
k.

Core Definitions
• Core Point: A point x is a core point if there are at least min samples points within
distance ε of it.

• Border Point: Lies within ε of a core point but doesn’t satisfy core conditions itself.

• Noise Point: Neither a core nor reachable from a core — treated as outlier.

14.2 DBSCAN Parameters


• eps (ε): Neighborhood radius for density estimation.

• min samples: Minimum number of points in a neighborhood to be considered a


cluster core.

28
14.3 Custom Implementation
def dbscan(X, eps=0.5, min_samples=5):
...
for i in range(n):
if len(neighbors[i]) < min_samples:
labels[i] = -1 # Noise
else:
# Expand cluster from core

14.4 Hyperparameter Tuning


We performed grid search across:
eps ∈ {5, 7.5, 10, 12.5}, min samples ∈ {3, 5, 7}
The best result on validation data was:
eps = 5, min samples = 3, ARI = 0.1972

14.5 Noise vs Clustered Points

Figure 24: DBSCAN: Count of Clustered vs Noise Points (Validation Set)

Interpretation:
• Majority of points were successfully clustered.
• Only a small fraction was considered noise — good signal density.
• High ARI was achieved when ε was tighter and density threshold was relaxed (min samples=3),
capturing fine-grained patterns.

29
14.6 Conclusion
DBSCAN provides a density-aware alternative to centroid-based clustering. Though it
yielded a lower ARI than KMeans, it robustly identifies outliers and is capable of modeling
non-convex clusters — making it valuable in high-noise or heterogeneous datasets .
Test ARI (DBSCAN with eps=5, min samples=3): 0.1053

15 Comp part
Note : We were told quite late that the format of how to submit the features of our
competitive models, by Atul sir; by that time, we had already named out paths and files
something else and they were pretty scattered because we used two different models and
fused them. Therefore our files are in the FeaturesC folder of our zipfile. The files have
been clearly mentioned in the .ipynb files as well. Apologies for the trouble caused, it will
hardly take 2 minutes to upload those very light .npy files.

15.1 Introduction
This document provides a comprehensive analysis of a Python script designed for an audio
processing and classification pipeline. The code performs the following tasks:

• Mounting Google Drive and setting up paths.

• Installing and loading two deep learning based pre-trained models: YAMNet (for
audio embeddings) and AST (Audio Spectrogram Transformer) for additional feature
extraction.

• Extracting embeddings from audio files in both training and test datasets.

• Saving extracted embeddings as .npy files and downloading them.

• Preprocessing the data by fusing embeddings from both models.

• Reducing the dimensionality of the fused features using tuned PCA (searching ideal
n over a range).

• Making an ensemble classifier (Random Forest, Logistic Regression, and XGBoost)


with hyperparameter tuning.

• Predicting on the test set and creating a submission file.

30
15.2 Step-by-Step Code Analysis
15.2.1 Mount Drive & Setup

1 from google . colab import drive


2 drive . mount ( ’/ content / drive ’)
3
4 import os
5 import numpy as np
6 import pandas as pd
7 import librosa
8 import tensorflow_hub as hub
9 from tqdm import tqdm
10 from google . colab import files
11
12
13
14 # Adjust these to your actual paths
15 train_folder = " / content / drive / MyDrive / audio - clustering -2402 - mtl -782/
Dataset / train_folder "
16 test_folder = " / content / drive / MyDrive / audio - clustering -2402 - mtl -782/
Dataset / test_folder "
17 train_label_csv = " / content / drive / MyDrive / audio - clustering -2402 - mtl -782/
Dataset / train_labels . csv "
18
19 # Where we ’ ll save . npy
20 X_yam_train_path = " / content / drive / MyDrive / X_yam_train . npy "
21 y_train_path = " / content / drive / MyDrive / y_train . npy "
22 X_yam_test_path = " / content / drive / MyDrive / X_yam_test . npy "
Listing 1: Mounting Drive and Importing Libraries

Explanation:
Simple stuff, mounting the drive where we kept the datasets, putting up their addresses,
and importing essential libraries

15.2.2 Overview of YAMNet


YAMNet is a deep convolutional neural network model designed for audio event classifica-
tion. It is pre-trained on the large-scale AudioSet dataset, which contains over 2 million
human-labeled 10-second sound clips covering 521 classes. YAMNet takes raw audio wave-
forms as input and outputs both frame-level predictions for sound events and a compact
embedding vector that captures rich audio features. This embedding can be used for
downstream tasks such as clustering, transfer learning, or further classification.

15.2.3 Processing Pipeline


The YAMNet model follows these major steps:

31
Step 1: Audio Preprocessing: The input is a raw audio waveform, x(t). The first step
involves computing a log-scaled mel spectrogram, which represents the short-term power
spectrum of the audio:
!
X
2
X[m, t] = log x (t + n) ϕm (n) + ϵ , (1)
n

where ϕm (n) denotes the m-th mel filter, and ϵ is a small constant added for numerical
stability.
Step 2: Convolutional Feature Extraction: The computed log mel spectrogram is
then fed into a series of convolutional layers. These layers often use depthwise separa-
ble convolutions to reduce the number of parameters while retaining performance. Each
convolutional layer can be mathematically described as:
 
y(l) = f BN W (l) ∗ y(l−1) + b(l) , (2)

where:
• y(l−1) is the input to the l-th layer (with y(0) = X[m, t]),
• W (l) and b(l) are the weights and biases for the l-th layer,
• ∗ denotes the convolution operation,
• BN(·) denotes batch normalization,
• f (·) is a non-linear activation function, typically the ReLU.
Step 3: Temporal Pooling and Embedding Generation: After several convolutional
stages, the feature maps are aggregated across time by global average pooling to produce
a fixed-dimensional representation, h ∈ R1024 :
T
1 X (L)
h= yt , (3)
T
t=1

(L)
where yt represents the output of the last convolutional layer at time t, and T is the
number of time frames.
Step 4: Classification: Finally, the embedding vector h is passed through a fully-connected
(dense) layer to produce logits for each of the 521 classes:

z = Wlogits h + blogits , (4)

and the class probabilities are computed via the softmax function:
exp(zi )
pi = P521 , for i = 1, 2, . . . , 521. (5)
j=1 exp(zj )

32
15.3 Mathematical Summary
To summarize, the mathematical operations performed by YAMNet are as follows:

• Input Transformation: Convert raw audio x(t) into a log mel spectrogram:
!
X
X[m, t] = log x2 (t + n) ϕm (n) + ϵ .
n

• Convolutional Layers: Process the spectrogram through a series of convolutional


layers:  
y(l) = f BN W (l) ∗ y(l−1) + b(l) .

• Global Pooling: Aggregate the time-dimension:


T
1 X (L)
h= yt .
T
t=1

• Final Classification: Compute logits and probabilities:

exp(zi )
z = Wlogits h + blogits , pi = P521 .
j=1 exp(zj )

15.3.1 Install & Load YAMNet Model

1 ! pip install tensorflow tensorflow_hub librosa -- quiet


2
3 yamnet_model = hub . load ( " https :// tfhub . dev / google / yamnet /1 " )
Listing 2: Installing Dependencies and Loading YAMNet

Explanation:

• Installs required packages (tensorflow, tensorflow hub, and librosa) quietly.

• Loads the YAMNet model from TensorFlow Hub.

15.3.2 Define YAMNet Embedding Extraction Function

1 def e x tr a c t _y a m n et _ e m be d d i ng ( wav_path , sr =16000) :


2 """ Load . wav , run YAMNet , return mean - pooled embedding of shape (1024 ,)
. """
3 try :
4 audio , _ = librosa . load ( wav_path , sr = sr )
5 if len ( audio ) == 0:

33
6 return None
7 _ , embeddings , _ = yamnet_model ( audio )
8 return np . mean ( embeddings . numpy () , axis =0) # (1024 ,)
9 except Exception as e :
10 print ( f " Error processing { wav_path }: { e } " )
11 return None
Listing 3: Extracting YAMNet Embeddings
Explanation:
a. Audio Loading: Uses librosa.load to load a WAV file and resample it to 16 kHz.

b. Empty Check: Returns None if the audio file is empty.

c. Model Inference: Runs the audio through YAMNet and extracts the embeddings.

d. Pooling: Computes the mean across the time dimension to obtain a fixed-length
vector (1024-dimensional).

e. Error Handling: Catches and prints any exceptions, returning None if an error
occurs.

15.3.3 Build YAMNet Embeddings for TRAIN Folder

1 labels_df = pd . read_csv ( train_label_csv ) # columns : [ filename , category ,


...]
2 X_train_list = []
3 y_train_list = []
4
5 for _ , row in tqdm ( labels_df . iterrows () , total = len ( labels_df ) ) :
6 wav_file = os . path . join ( train_folder , row [ " filename " ])
7 emb = e x tr a c t _y a m n et _ e m be d d i ng ( wav_file )
8 if emb is not None :
9 X_train_list . append ( emb )
10 y_train_list . append ( row [ " category " ])
11 else :
12 print ( f " Skipped embedding for { row [ ’ filename ’]} " )
13
14 X_yam_train = np . vstack ( X_train_list )
15 y_train = np . array ( y_train_list )
16 print ( " X_yam_train shape : " , X_yam_train . shape )
17 print ( " y_train shape : " , y_train . shape )
18
19 # Save . npy
20 np . save ( X_yam_train_path , X_yam_train )
21 np . save ( y_train_path , y_train )
22 print ( f " Saved : { X_yam_train_path } and { y_train_path } " )
Listing 4: Extracting and Saving Train Embeddings

34
Explanation:
• Reads the CSV file containing the training labels.

• Iterates through each row to:

a. Construct the file path for each audio file.


b. Extract the YAMNet embedding.
c. Append the embedding and corresponding category to lists.

• Stacks the embeddings into a NumPy array and saves both the embeddings and labels
as .npy files.

15.3.4 Build YAMNet Embeddings for TEST Folder

1 test_files = sorted ([ f for f in os . listdir ( test_folder ) if f . endswith ( ’. wav


’) ])
2 X_test_list = []
3 skipped_files = []
4
5 for fname in tqdm ( test_files ) :
6 wav_path = os . path . join ( test_folder , fname )
7 emb = e x tr a c t _y a m n et _ e m be d d i ng ( wav_path )
8 if emb is not None :
9 X_test_list . append ( emb )
10 else :
11 skipped_files . append ( fname )
12
13 X_yam_test = np . vstack ( X_test_list )
14 print ( " X_yam_test shape : " , X_yam_test . shape )
15
16 # Save . npy
17 np . save ( X_yam_test_path , X_yam_test )
18 print ( f " Saved : { X_yam_test_path } " )
19
20 if skipped_files :
21 print ( " Skipped these test files : " , skipped_files )
Listing 5: Extracting and Saving Test Embeddings

Explanation:
• Lists all WAV files in the test folder.

• Extracts embeddings for each test file using the previously defined function.

• Collects and prints any files that were skipped due to errors.

• Saves the resulting test embeddings as a .npy file.

35
15.3.5 Download .npy Files Locally (Optional)

1 files . download ( X_yam_train_path )


2 files . download ( y_train_path )
3 files . download ( X_yam_test_path )
Listing 6: Downloading Files Locally

Explanation:

• Uses files.download from the google.colab package to download the generated


.npy files to your local machine.

15.3.6 Overview of the AST Model


The Audio Spectrogram Transformer (AST) is a deep learning model designed for audio
classification tasks. AST leverages the transformer architecture, originally developed for
natural language processing, and adapts it to work with audio spectrograms. In the context
of the provided code, AST is used to extract a fixed-length embedding from a given audio
file, which can then be used for tasks such as classification or clustering.
The main steps in the AST pipeline, as implemented in the code, are:

Step 1: Loading the audio file and resampling it to a standard sampling rate (16 kHz).

Step 2: Converting the waveform into a log-scaled mel spectrogram.

Step 3: Feeding the spectrogram into the AST feature extractor, which prepares the input
for the transformer.

Step 4: Passing the processed spectrogram through the AST model to obtain transformer-
based embeddings.

Step 5: Extracting the embedding corresponding to the [CLS] token, which serves as a
compact representation of the audio.

15.3.7 Detailed Explanation in the Context of the Code


In the code snippet for AST embedding extraction, the following steps occur:

1. Audio Loading and Resampling:


The audio file is loaded using torchaudio.load, which returns the waveform and its
original sampling rate. The waveform is then resampled at 16 kHz using Torchau-
dio.transforms.Resample. If the audio has multiple channels, the channels are
averaged to produce a single-channel (mono) signal.

36
2. Spectrogram Feature Extraction:
The resampled waveform is then passed to the ASTFeatureExtractor. Internally,
this extractor computes a log-scaled mel spectrogram from the waveform. This spec-
trogram is analogous to an image, where the time and frequency dimensions represent
the two axes.

3. Transformer-Based Embedding Extraction:


The preprocessed spectrogram is fed into the AST model. Similar to the Vision
Transformer (ViT), the spectrogram is divided into patches. Each patch is flattened
and projected linearly into a latent embedding space. Positional embeddings are
added to these patch embeddings to retain the spatial (time-frequency) relationships.
The resulting sequence is then passed through several transformer encoder layers.

4. Classification Token ([CLS]) Embedding:


In the transformer architecture, a special token ([CLS]) is prepended to the sequence
of patch embeddings. After the transformer processing, the output corresponding
to this [CLS] token is taken as the global representation (embedding) of the entire
spectrogram.

15.3.8 Mathematical Description of the AST Model


Below is a mathematical formulation of the core components of the AST model:
*1. Input and Spectrogram Computation
Let x(t) be the raw audio waveform. The AST feature extractor converts x(t) into a
log-scaled mel spectrogram S ∈ RF ×T , where F is the number of mel frequency bins and
T is the number of time frames. This operation is given by:
!
X
2
S[f, t] = log x (t + n) ϕf (n) + ϵ ,
n

where ϕf (n) represents the mel filter for frequency bin f , and ϵ is a small constant for
numerical stability.
*2. Patch Embedding
The spectrogram S is divided into N patches. Each patch Si is flattened into a vector
si ∈ RP (with P being the patch size). A linear projection is then applied to each patch:

z i = W p s i + bp ,

where Wp ∈ RD×P is the projection matrix, bp ∈ RD is the bias, and D is the embedding
dimension.
*3. Positional Embedding and Input to Transformer

37
A learnable positional embedding Epos ∈ R(N +1)×D is added to the sequence of patch
embeddings. A special [CLS] token with embedding zcls ∈ RD is prepended to the sequence:

Z0 = [zcls ; z1 ; z2 ; . . . ; zN ] + Epos ,

where Z0 ∈ R(N +1)×D is the input to the transformer encoder.


*4. Transformer Encoder
The transformer encoder consists of multiple layers. For the l-th layer, the operations
are as follows:
Z′l = LayerNorm(Zl−1 ),
Al = MultiHead(Z′l ),
Z′′l = Zl−1 + Al ,
Zl = Z′′l + MLP(LayerNorm(Z′′l )).
After L such layers, the final output is ZL .
*5. Extraction of the [CLS] Token
(L)
The embedding corresponding to the [CLS] token, zcls , is extracted from the final
output:
(L)
h = zcls .
This vector h ∈ RD is used as the fixed-length representation of the input audio signal.

15.3.9 Load AST Model and Feature Extractor

1 fea ture_extractor = ASTFeatureExtractor . from_pretrained ( " MIT / ast - finetuned -


audioset -10 -10 -0.4593 " )
2 model = ASTModel . from_pretrained ( " MIT / ast - finetuned - audioset -10 -10 -0.4593 " )
3 model . eval ()
4 device = torch . device ( " cuda " if torch . cuda . is_available () else " cpu " )
5 model . to ( device )
Listing 7: Loading AST Model

Explanation:

• Loads the AST feature extractor and model that has been fine-tuned on AudioSet.

• Sets the model to evaluation mode.

• Determines whether a CUDA-enabled GPU is available and moves the model to the
appropriate device.

38
15.3.10 Define AST Embedding Extraction Function

1 def e x tra ct_as t_em bedd ing ( wav_path ) :


2 try :
3 waveform , sr = torchaudio . load ( wav_path )
4 if waveform . shape [1] == 0:
5 return None
6
7 resample = torchaudio . transforms . Resample ( orig_freq = sr , new_freq
=16000)
8 waveform = resample ( waveform ) . mean ( dim =0) . unsqueeze (0)
9
10 inputs = feature_extractor ( waveform . squeeze () . numpy () ,
sampling_rate =16000 , return_tensors = " pt " )
11 inputs = { k : v . to ( device ) for k , v in inputs . items () }
12 with torch . no_grad () :
13 outputs = model (** inputs )
14 # Take the [ CLS ] token embedding
15 return outputs . last_hidden_state [: , 0 , :]. cpu () . numpy () . squeeze ()
16 except Exception as e :
17 print ( f " { wav_path } failed : { e } " )
18 return None
Listing 8: Defining AST Embedding Extraction Function

Explanation:

a. Audio Loading: Loads the waveform using torchaudio.

b. Resampling and Averaging: Resamples the waveform to 16 kHz and averages


across channels if necessary.

c. Feature Extraction: Uses the AST feature extractor to prepare inputs for the
model.

d. Model Inference: Runs the AST model to obtain the output embeddings, taking
the embedding corresponding to the [CLS] token.

e. Error Handling: Returns None if an exception occurs.

15.3.11 Load Train Set & Extract AST Embeddings

1 labels_df = pd . read_csv ( label_csv )


2 X , y = [] , []
3
4 for _ , row in tqdm ( labels_df . iterrows () , total = len ( labels_df ) ) :
5 emb = e xtra ct_a st_em bedd ing ( os . path . join ( train_folder , row [ ’ filename ’ ])
)
6 if emb is not None :

39
7 X . append ( emb )
8 y . append ( row [ ’ category ’ ])
9
10 X = np . vstack ( X )
11 y = np . array ( y )
12 print ( " Train shape : " , X . shape )
13
14 np . save ( ’ X_train . npy ’ , X )
15 np . save ( ’ y_train . npy ’ , y )
Listing 9: Extracting AST Embeddings for Training Data

Explanation:

• Reads the CSV file containing labels.

• Iterates over each training file to extract AST embeddings.

• Stacks the embeddings and saves them along with the labels.

15.3.12 Process Test Set & Predict (AST Pipeline)

1 test_files = sorted ([ f for f in os . listdir ( test_folder ) if f . endswith ( ’. wav


’) ])
2 X_test , test_ids = [] , []
3
4 for f in tqdm ( test_files ) :
5 emb = e xtra ct_a st_em bedd ing ( os . path . join ( test_folder , f ) )
6 if emb is not None :
7 X_test . append ( emb )
8 test_ids . append ( f )
9
10 X_test = np . vstack ( X_test )
11 X_test_scaled = scaler . transform ( X_test )
12 X_test_pca = pca . transform ( X_test_scaled )
13
14 np . save ( ’ X_test . npy ’ , X_test )
15 from google . colab import files
16 files . download ( ’ X_test . npy ’)
Listing 10: Processing Test Set with AST Embeddings

Explanation:

• Processes the test set similarly to the training data, extracting AST embeddings.

• Applies scaling and PCA transformation (after having fitted these on the training
set).

• Saves and downloads the processed test embeddings.

40
16 Deep Audio Representation Fusion
16.1 YAMNet Architecture Specifications
• Input: 96ms frames (15600 samples @ 16kHz)

• Base network: MobileNetV1 with depthwise separable convolutions

• Layer decomposition:

Conv2Dk,s → BatchNorm → ReLU6

• Final embedding: Global average pooling → 1024D dense

16.2 Fusion
Late fusion of embeddings via concatenation:

hfusion = ReLU(W[hYAMNet ; hAST ] + b)

where W ∈ R2048×d , optimized via:

L = αLCE + (1 − α)Ltriplet

17 Comparative Analysis

Table 1: Feature Representation Capabilities


Feature MFCC YAMNet AST
Time Resolution 25ms frames 960ms context Full sequence
Freq. Resolution 40 mel bins 64 mel bins 128 log-mel
Temporal Modeling Fixed window Conv nets Self-attention
Harmonic Analysis Limited Event-driven Pitch-sensitive

17.1 Limitations of Traditional Handcrafted Features


• Limited Representational Power: Handcrafted features like MFCC, spectral,
and chroma are designed based on specific signal processing insights. While they
capture important aspects of audio, such as timbre and harmonic content, they often
fail to encapsulate the full complexity of audio signals.

41
• Fixed Feature Extraction Process: These features are computed using fixed
mathematical formulas. They lack the ability to adapt to different audio tasks or
datasets, which may lead to suboptimal performance in diverse scenarios.

• Manual Tuning and Domain Expertise: The design and selection of these fea-
tures require significant domain expertise and manual tuning. In contrast, deep
learning models automatically learn the best representations from the data.

17.1.1 Advantages of Deep Learning Models: YAMNet and AST


YAMNet
YAMNet is a convolutional neural network model pre-trained on the extensive AudioSet
dataset. It is capable of capturing rich and discriminative audio features from raw wave-
forms. The deep hierarchical representations extracted by YAMNet capture local patterns
and temporal structures that are crucial for understanding audio events.

AST (Audio Spectrogram Transformer)


AST leverages the transformer architecture, originally popularized in natural language
processing, to process audio spectrograms. It is highly effective in capturing long-range
dependencies and complex temporal patterns in the audio signal. By dividing the spec-
trogram into patches and using self-attention mechanisms, AST learns global relationships
across time and frequency, providing a comprehensive representation of the audio.

18 Benefits of Fusing YAMNet and AST


• Complementary Representations:
YAMNet and AST capture different aspects of audio. YAMNet excels at extract-
ing local and mid-level features through its convolutional layers, while AST cap-
tures global, long-range dependencies via the transformer’s self-attention mechanism.
Their fusion leverages the strengths of both approaches.

• Robustness to Variations:
Deep learning models trained on large-scale datasets inherently learn to generalize
over a wide variety of acoustic environments. This results in representations that are
robust to background noise, variations in recording conditions, and other distortions
that can degrade handcrafted features.

• Automatic Feature Learning:


Unlike MFCC or chroma features, which are computed using fixed, hand-engineered
algorithms, the features extracted by YAMNet and AST are learned automatically

42
from data. This means that the models can capture subtle nuances in the audio that
might be missed by traditional methods.
• Redundancy of Traditional Features:
Since the deep models provide a rich and comprehensive representation of the audio
signal, they subsume the information captured by traditional features. Consequently,
incorporating MFCC, spectral, or chroma features becomes redundant. The fusion
of YAMNet and AST embeddings effectively encapsulates both the local and global
characteristics of the audio, rendering additional handcrafted features unnecessary.

18.1 (Fusion): Load and Fuse AST and YAMNet Embeddings


1 X_ast_train = np . load ( X_ast_train_path ) # shape ( n_samples , d_ast )
2 X_yam_train = np . load ( X_yam_train_path ) # shape ( n_samples , d_yam )
3 y_train = np . load ( y_train_path ) # shape ( n_samples ,)
4
5 X_ast_test = np . load ( X_ast_test_path ) # shape ( n_test , d_ast )
6 X_yam_test = np . load ( X_yam_test_path ) # shape ( n_test , d_yam )
7
8 # Fuse them horizontally
9 X_train_fused = np . hstack ([ X_ast_train , X_yam_train ]) # shape ( n_samples ,
d_ast + d_yam )
10 X_test_fused = np . hstack ([ X_ast_test , X_yam_test ]) # shape ( n_test ,
d_ast + d_yam )
11
12 print ( " Fused Train shape : " , X_train_fused . shape )
13 print ( " Fused Test shape : " , X_test_fused . shape )
Listing 11: Loading and Fusing Embeddings
Explanation:
• Loads the saved embeddings for both AST and YAMNet.
• Horizontally stacks (concatenates) the embeddings to create fused feature sets for
training and testing.

18.2 Label Encoding


1 label_encoder = LabelEncoder ()
2 y_encoded = label_encoder . fit_transform ( y_train )
3
4 print ( " Encoded labels shape : " , y_encoded . shape )
5 print ( " Unique classes : " , len ( np . unique ( y_encoded ) ) )
Listing 12: Label Encoding
Explanation:
• Uses LabelEncoder to convert categorical labels into numeric values.

43
18.3 Scale the Fused Embeddings
1 scaler = StandardScaler ()
2 X_train_scaled = scaler . fit_transform ( X_train_fused )
3 X_test_scaled = scaler . transform ( X_test_fused )
Listing 13: Standardization

Explanation:
• Applies standard scaling to the fused training and test features to normalize the
feature distributions.

19 Principal Component Analysis (PCA)


19.1 Mathematical Formulation
Given centered data matrix ∈ Rn×d with n samples and d features:

1. Compute covariance matrix:


1 ⊤
= (6)
n−1
2. Eigenvalue decomposition:
= Λ⊤ (7)
where Λ = diag(λ1 , . . . , λd ) contains eigenvalues (λ1 ≥ λ2 ≥ · · · ≥ λd ).

3. Select top-k eigenvectors:


k= [w1 |w2 | · · · |wk ] (8)

4. Project data to lower dimension:


pca =k (9)

19.2 Variance Explained


Cumulative explained variance ratio:
Pk
λi
rk = Pi=1
d
(10)
i=1 λi

20 Hyperparameter Tuning for PCA


20.1 Key Parameter
• n components: Number of principal components to retain

44
20.2 Tuning Methodology
The optimization process follows:

1. For candidate dimensions k ∈ {kmin , . . . , kmax }:


(k)
(a) Compute PCA projection pca

(b) Split into training/validation sets


(c) Train classifier f on reduced space
(d) Compute Adjusted Rand Index (ARI):

RI − E[RI]
ARI = (11)
max(RI) − E[RI]

2. Select k ∗ with maximal validation ARI:

k ∗ = arg max ARI(f ((k)


pca ), ytrue ) (12)
k

20.3 Implementation Considerations


• Requires standardized input data

• Trade-off: Higher k preserves more variance but increases dimensionality

• Evaluation metric (ARI) measures cluster similarity between predicted and true labels

Table 2: Component Selection Trade-offs


Components Characteristics
Too low (k ≪ d) Loss of discriminative information
Optimal k ∗ Maximal class separation
Too high (k ≈ d) Noise retention, overfitting

20.4 Step 4: Dimensionality Reduction with PCA and Hyperparameter


Tuning
1 best_ari = 0
2 best_pca_model = None
3 best_X_pca = None

45
4 bes t_n_components = 0
5
6 for n in [60 , 80 , 100 , 110 , 120 , 130 , 140 , 150 , 160 , 170 , 180 , 190 , 200 ,
210 , 220]:
7 pca = PCA ( n_components =n , random_state =42)
8 X_pca_candidate = pca . fit_transform ( X_train_scaled )
9
10 # Quick baseline check on each dimension with simple LR
11 X_train_p , X_val_p , y_train_p , y_val_p = train_test_split (
12 X_pca_candidate , y_encoded ,
13 test_size =0.2 , stratify = y_encoded , random_state =42
14 )
15
16 # Quick logistic baseline
17 clf = LogisticRegression ( max_iter =2000 , solver = ’ saga ’)
18 clf . fit ( X_train_p , y_train_p )
19 ari_candidate = adjusted_rand_score ( y_val_p , clf . predict ( X_val_p ) )
20
21 print ( f " PCA { n } D ARI = { ari_candidate :.4 f } " )
22
23 if ari_candidate > best_ari :
24 best_ari = ari_candidate
25 best_X_pca = X_pca_candidate
26 best_pca_model = pca
27 best_n_components = n
28
29 print ( f " \ n Best PCA Dim = { best_n_components } , baseline ARI = { best_ari
:.4 f } " )
Listing 14: PCA Tuning
Explanation: This code block performs hyperparameter tuning for Principal Compo-
nent Analysis (PCA) by trying out different numbers of components and evaluating the
performance of a simple Logistic Regression classifier. The objective is to select the PCA
dimension that results in the best baseline performance as measured by the Adjusted Rand
Index (ARI). Below is a detailed breakdown of the process:

Step 1: Initialization:
• best ari is initialized to 0. This variable will store the highest ARI score observed.
• best pca model is set to None and will later hold the PCA model with the optimal
number of components.
• best X pca will store the transformed training data corresponding to the best PCA
model.
• best n components is initialized to 0 and will record the optimal number of PCA
components.
Step 2: Loop Over Candidate PCA Dimensions:

46
• The code iterates over a list of candidate dimensions: [60, 80, 100, 110, 120,
130, 140, 150, 160, 170, 180, 190, 200, 210, 220].
• For each candidate value n, a PCA model is instantiated with n components=n and a
fixed random state for reproducibility.
• The PCA model is then fitted to the standardized training data (X train scaled)
and used to transform it, producing X pca candidate.

Step 3: Baseline Evaluation with Logistic Regression:

• The transformed data is split into a training and validation set using train test split.
The split is stratified based on the encoded labels (y encoded) to preserve the class
distribution.
• A Logistic Regression classifier is instantiated with a maximum of 2000 iterations and
the saga solver, which is suitable for large datasets.
• The classifier is trained on the training split (X train p and y train p) and its per-
formance is evaluated on the validation set using the Adjusted Rand Index (ARI) as
the metric.

Step 4: Updating the Best Model:

• The ARI score for the current PCA dimension is printed.


• If the current ARI (ari candidate) exceeds the best ARI observed so far (best ari),
the following updates are made:
– best ari is set to the current ARI.
– best X pca stores the PCA-transformed data.
– best pca model is updated with the current PCA model.
– best n components records the current number of components.

Step 5: Final Output:

• After the loop completes, the code prints the optimal PCA dimensionality and the
corresponding baseline ARI.

20.5 Train/Validation Split Using Best PCA

1 X_pca_trainval = best_X_pca
2 X_train , X_val , y_train_ , y_val_ = train_test_split (
3 X_pca_trainval , y_encoded ,
4 test_size =0.2 , stratify = y_encoded , random_state =42
5 )
Listing 15: Splitting Data

47
Explanation:
• Splits the PCA-transformed training data into training and validation sets for sub-
sequent model tuning.

21 Grid Search Cross-Validation


21.1 Mathematical Formulation
Given model f with parameter space Θ, find:
k
∗ 1X (i) (i)
θ = arg max Accuracy(fθ (Xval ), yval ) (13)
θ∈Θ k
i=1

where k = 3 folds in the code. The search space uses Cartesian product:
m
(1) (2)
Y
Θgrid = {θi , θi , ...} (14)
i=1

22 Random Forest
22.1 Algorithm
• Ensemble of B decision trees: {Tb (x)}B
b=1

• Final prediction: ŷ = mode{Tb (x)}

22.2 Key Equations


1. Gini impurity for node split:
C
X
G=1− p2c (15)
c=1

2. Feature importance for feature j:


B
1 XX
Impj = ∆Gjt (16)
B
b=1 t∈Tb

22.3 Grid Parameters


• n estimators: Number of trees B ∈ {100, 200}
• max depth: Tree depth limit ∈ {20, ∞}
• min samples split: Minimum samples to split ∈ {2, 5}

48
23 Logistic Regression
23.1 Mathematical Formulation
Multinomial logistic regression with:
T
ewc x
P (y = c|x) = PK (17)
wkT x
k=1 e

23.2 Optimization Objective


n
1 X T
min ||w||2 +C log(1 + e−yi w xi ) (18)
w 2
| {z } i=1
L2 reg

23.3 Grid Parameters


• C: Inverse regularization strength ∈ {0.1, 1, 10}

• solver: Optimization method ∈ {lbfgs, saga}

24 XGBoost
24.1 Model Definition
Gradient boosted trees with additive functions:
T
X
ŷi = ft (xi ), ft ∈ F (19)
t=1

24.2 Objective Function


n
X T
X
L= l(yi , ŷi ) + Ω(ft ) (20)
i=1 t=1
1 2
where Ω(f ) = γT + 2 λ||w||

24.3 Grid Parameters


• n estimators: Boosting rounds T ∈ {100, 200}

• max depth: Tree depth ∈ {6, 10}

• learning rate: Shrinkage factor η ∈ {0.05, 0.1}

49
25 Code Implementation Strategy
The tuning process combines three mathematical approaches:

1. Parallel Model Tuning:


∗ ∗ ∗
{θRF , θLR , θXGB } = {arg max CV-score(fθ )}θ∈Θgrid (21)

2. Validation Metric:
nval
1 X
Accuracy = I(ŷi = yi ) (22)
nval
i=1

3. Ensemble Foundation:
 
ŷfinal = Vote fθRF
∗ , fθ ∗ , fθ ∗
LR XGB
(23)

Algorithm 1 Model Tuning Pipeline


1: Standardize features: Xscaled = (X − µ)/σ
2: for each classifier f ∈ {RF, LR, XGB} do
3: Search Θgrid using 3-fold CV
4: Select θ∗ with max validation accuracy
5: Store best estimator fθ∗
6: end for

26 Grid Search Cross-Validation


26.1 Code Structure

1: Define parameter grid Θ for model f


2: Initialize GridSearchCV with:
3: estimator = f , param grid = Θ, cv = 3
4: Fit on training data: grid search.fit(X train, y train)
5: Retrieve best model: best estimator = grid search.best estimator

1 rf_grid = {
2 ’ n_estimators ’: [100 , 200] ,
3 ’ max_depth ’: [20 , None ] ,
4 ’ min_samples_split ’: [2 , 5]
5 }

50
6 rf_search = GridSearchCV ( Ra nd omF or es tC las si fi er ( random_state =42) ,
7 rf_grid ,
8 cv =3 , scoring = ’ accuracy ’ , n_jobs = -1)
9 rf_search . fit ( X_train , y_train_ )
10 clf_rf = rf_search . best_estimator_
11 print (" RF Best Params :" , rf_search . best_params_ )

26.2 Mathematical Basis


Minimizes the empirical risk through cross-validation:
3
1X (i) (i)
θ∗ = arg min L(fθ (Xval ), yval ) (24)
θ∈Θ 3
i=1

27 Random Forest Tuning


27.1 Code Breakdown
• Line 1: Define parameter grid with tree count and depth constraints

• Line 2: Initialize GridSearchCV with 3-fold cross-validation

• Line 3: Fit on training data (X train, y train)

• Line 4: Store best performing estimator

27.2 Forest Mathematics


For B trees with predictions Tb (x):

ŷ = mode {Tb (x)}B



b=1 (25)

Feature importance calculated through mean Gini impurity reduction:


B
1 X X
Importancej = ∆Gjt (26)
B
b=1 t∈nodesb

28 Logistic Regression Tuning


1 lr_grid = {
2 ’C ’: [0.1 , 1 , 10] ,
3 ’ solver ’: [ ’ lbfgs ’ , ’ saga ’] ,
4 ’ penalty ’: [ ’ l2 ’] ,
5 ’ max_iter ’: [5000]

51
6 }
7 lr_search = GridSearchCV ( Logi sticRegression () ,
8 lr_grid ,
9 cv =3 , scoring = ’ accuracy ’ , n_jobs = -1)
10 lr_search . fit ( X_train , y_train_ )
11 clf_lr = lr_search . best_estimator_
12 print (" LR Best Params :" , lr_search . best_params_ )

28.1 Code Implementation


• Line 1: Regularization strength grid C ∈ {0.1, 1, 10}

• Line 2: Solver selection for optimization

• Line 3: L2 penalty for weight shrinkage

• Line 4: Increased max iter for convergence

28.2 Regression Mathematics


Multinomial logistic loss with regularization:
n
1 X  ⊤

min ∥w∥2 + log 1 + e−yi w xi (27)
w 2C
i=1

Probability estimation via softmax:



ewc x
P (y = c|x) = PK (28)
wk⊤ x
k=1 e

29 XGBoost Tuning
1 xgb_grid = {
2 ’ n_estimators ’: [100 , 200] ,
3 ’ max_depth ’: [6 , 10] ,
4 ’ learning_rate ’: [0.05 , 0.1]
5 }
6 xgb_base = xgb . XGBClassifier (
7 objective = ’ multi : softprob ’ ,
8 num_class = len ( np . unique ( y_encoded ) ) ,
9 eval_metric = ’ mlogloss ’ ,
10 use_label_encoder = False ,
11 random_state =42
12 )
13
14 xgb_search = GridSearchCV ( xgb_base ,

52
15 xgb_grid ,
16 cv =3 , scoring = ’ accuracy ’ , n_jobs = -1)
17 xgb_search . fit ( X_train , y_train_ )
18 clf_xgb = xgb_search . best_estimator_
19 print (" XGB Best Params :" , xgb_search . best_params_ )
20 %

29.1 Code Configuration


• Line 1: Tree depth and learning rate grid

• Line 2: multi:softprob objective for multiclass

• Line 3: log-loss evaluation metric

• Line 4: Early stopping prevention

29.2 Boosting Mathematics


Additive model with T trees:
(t) (t−1)
ŷi = ŷi + ηft (xi ) (29)
Regularized objective function:
n
X 1
L= l(yi , ŷi ) + γT + λ∥w∥2 (30)
2
i=1

30 Model Combination Strategy


30.1 Code Integration
The tuned models are combined through:
X
ŷensemble = argmaxc I(ŷm = c) (31)
m∈{RF,LR,XGB}

30.2 Mathematical Fusion


3
1X
Pensemble (y = c|x) = Pm (y = c|x) (32)
3
m=1

Final prediction uses weighted confidence scores:

ŷ = argmaxc (αRF PRF + αLR PLR + αXGB PXGB ) (33)

53
Table 3: Hyperparameter Search Spaces
Model Parameters Values
Random Forest n estimators {100, 200}
max depth {20, None}
Logistic Regression C {0.1, 1, 10}
XGBoost learning rate {0.05, 0.1}

30.3 Conclusion : Why the strategy works


30.3.1 Complementary Error Profiles
• Random Forest (Bagging):
1−ρ 2
VarRF = ρσ 2 +
σ (34)
B
where ρ is tree correlation, B=number of trees
• XGBoost (Boosting):
T
X
(t) (t−1)
ŷ = ŷ +η wj I(x ∈ Rj ) (35)
j=1

where η=learning rate, Rj =tree regions

30.3.2 Feature Space Coverage

Table 4: Feature Handling Capabilities


Feature Type RF XGBoost
High-cardinality Feature subsampling Sparse-aware splits
Non-linear Deep trees Additive expansion
Missing values Median imputation Default directions

30.3.3 Computational Synergy

RF Speed ∝ B × O(nsamples log nfeatures )


T
X
XGBoost Speed ∝ O(nnon-missing )
t=1
Combined Throughput = 0.92 × (RFcores + XGBGPU )

54
30.3.4 Regularization Balance
• XGBoost:
1
Ω(f ) = γT + λ∥w∥2 (36)
2
• Random Forest:
n
1 X OOB
OOB Error = I(ŷi ̸ yi )
= (37)
n
i=1

30.3.5 Conclusion
The combination works because:

• RF’s variance reduction complements XGB’s bias reduction

• Different regularization approaches prevent overfitting

• Complementary hardware utilization (CPU+GPU)

• Orthogonal feature handling strategies

31 Ensemble with Soft Voting and Validation


31.1 Soft Voting Mechanics
31.1.1 Mathematical Formulation
For M classifiers and C classes, the ensemble prediction is:
M
X
ŷ = arg max wm Pm (y = c|x) (38)
c∈{1,...,C}
m=1

Where:

• wm : Weight for classifier m (default: wm = 1


M)

• Pm (y = c|x): Probability estimate from classifier m

55
31.1.2 Code Implementation

1: Initialize ensemble with base classifiers:


2: estimators = [(rf, fRF ), (lr, fLR ), (xgb, fXGB )]
3: Set voting=’soft’ for probability averaging
4: Optional: Specify weights=[1,2,2] for classifier importance
5: Fit ensemble on training data: Dtrain = (Xtrain , ytrain )
6: Predict using Pweighted probabilities:
7: ŷ = arg maxc m wm Pm (y = c|Xval )
RI−E[RI]
8: Calculate ARI: ARI = max(RI)−E[RI]

31.2 Code Component Analysis

Table 5: Code Function Mapping


Code Element Mathematical Equivalent
P
VotingClassifier wm Pm (y|x)
voting=’soft’ Eq. (1) prediction
P rule
fit(X train, y train) arg minθ L( wm fm (X), y)
predict(X val) ŷ = arg maxc Ensemble(Pm )

31.3 Key Advantages in This Implementation


• Probability Fusion: Combines confidence estimates rather than hard labels

• Class Separability: Particularly effective when:

∃m ̸= n : argmaxc Pm ̸= argmaxc Pn (39)

• Weighted Influence: Optional weights allow emphasizing better performers ; in


one of our test models, we also tried doing GridSearchCV to find optimal weights
but that was pretty inefficient.
wm
Effective Weight = PM (40)
k=1 wk

56
31.4 Performance Validation
Adjusted Rand Index (ARI) calculation:
Agreement − Expected Agreement
ARI =
Max Agreement − Expected Agreement
n

2 (a + d) − [(a + b)(a + c) + (c + d)(b + d)]
= n 2

2 − [(a + b)(a + c) + (c + d)(b + d)]

Where a=true positives, b=false positives, c=false negatives, d=true negatives

31.5 Why This Works Well

Table 6: Ensemble vs Individual Classifiers


Metric Individual Models Ensemble
Variance High Reduced 18-22%
Decision Boundary Local optima Global consensus
Feature Sensitivity Model-specific Balanced integration
Outlier Robustness Moderate High (t-test p < 0.01)

This section describes how an ensemble classifier is built using the soft voting strategy.
The ensemble combines three base classifiers — Random Forest (RF), Logistic Regression
(LR), and XGBoost (XGB) — whose best hyperparameters were obtained through prior
grid search tuning. The ensemble is evaluated on a validation set using the Adjusted Rand
Index (ARI) metric.

Validation and Evaluation


1 # Validation ARI
2 y_val_pred = ensemble . predict ( X_val )
3 ari_ensemble = adjusted_rand_score ( y_val_ , y_val_pred )
4 print ( f " Ensemble Validation ARI : { ari_ensemble :.4 f }")

Explanation
• Predicting on the Validation Set:

– The ensemble classifier makes predictions on the validation set (X val) by com-
bining the probability outputs of each base model.

• Evaluation Metric — Adjusted Rand Index (ARI):

57
– The adjusted rand score function is used to evaluate the clustering perfor-
mance by comparing the true labels (y val ) with the predicted labels (y val pred).
– ARI is a metric that measures the similarity between two data clusterings, ad-
justed for chance. A higher ARI indicates better agreement between the pre-
dicted clusters and the true labels.

• Result Output:

– The computed ARI score is printed, providing an indication of the ensemble’s


performance on the validation set.

31.6 Predict on Test Set and Create Submission File

1 X_test_pca = best_pca_model . transform ( X_test_scaled )


2 y_test_pred = ensemble . predict ( X_test_pca )
3 y_test_labels = label_encoder . inverse_transform ( y_test_pred )
4
5 # Save & Download Submission
6 test_files = sorted ([ f for f in os . listdir ( test_folder ) if f . endswith ( ’. wav
’) ])
7 submission_df = pd . DataFrame ({ ’ id ’: test_files , ’ category ’: y_test_labels })
8 submission_df . to_csv ( " su bm iss io n_ co nca te na te . csv " , index = False )
9 print ( " su bm iss io n_ con ca te nat e . csv saved ! " )
10
11 files . download ( " su bm iss io n_ con ca te nat e . csv " )
Listing 16: Predicting on Test Set and Saving Submission

Explanation:
The test set is transformed using the trained PCA model, and predictions are made
using the ensemble classifier. These are converted back to categorical labels and saved as
a submission-ready CSV file.

58
31.7 Discussion of Results

Table 7: Comparison of Audio Classification Strategies by ARI Score


Strategy ARI Score
YAMNet + AST Fusion with PCA 0.9619
Tuning and Ensemble (with Hyperpa-
rameter Tuning)
Ensemble Learning on AST Embed- 0.9532
dings Only
Weighted Ensemble (Model-Based 0.9444
Weighting)
AST Embeddings with Random Forest 0.9101
Classifier
YAMNet + Classical Features with 0.7745
PCA Tuning
YAMNet + Classical Features with 0.7245
UMAP Reduction

So , it is quite obvious that our strategy is quite good and has outperformed other possible
strategies that we played with : of course, people have outperformed us and there is scope
for improvement !
Cheers !

59

You might also like