Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views39 pages

QAS Final Report-2

The dissertation presents 'Multisense GPT, a Question Answering System,' developed by Chaiti Paul and Deep Das under the supervision of Prof. Debjani Bhattacharjee, aiming to enhance information retrieval through an interactive QA system that processes textual, visual, and auditory data. The system utilizes various Large Language Models to generate meaningful responses to natural language queries, addressing the limitations of existing QA systems. The document includes acknowledgments, a literature review, methodology, and a detailed exploration of the system's capabilities and components.

Uploaded by

wefot57202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

QAS Final Report-2

The dissertation presents 'Multisense GPT, a Question Answering System,' developed by Chaiti Paul and Deep Das under the supervision of Prof. Debjani Bhattacharjee, aiming to enhance information retrieval through an interactive QA system that processes textual, visual, and auditory data. The system utilizes various Large Language Models to generate meaningful responses to natural language queries, addressing the limitations of existing QA systems. The document includes acknowledgments, a literature review, methodology, and a detailed exploration of the system's capabilities and components.

Uploaded by

wefot57202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Multisense GPT, a Question

Answering System
A DISSERTATION

submitted in partial fulfillment of the


requirements for the award of the degree of

Master of Science
in
Computer Science

by

Chaiti Paul
Student roll:

22332305 Deep

Das
Student roll: 22332306

Under the supervision of

Prof. Debjani Bhattacharjee

DEPARTMENT OF COMPUTER SCIENCE


ACHARYA PRAFULLA CHANDRA COLLEGE
WEST BENGAL STATE UNIVERSITY
July 2024
D ECLARATION

W
e,Chaiti Paul and textitDeep Das declare that this that the

work entitled textitMultisense GPT, a Question

Answering System in the thesis is original and has been

done by ourselves under the

supervision of Prof. Debjani Bhattacharjee. The work has not been

submitted for a degree elsewhere, in part or in full. Further, due credit has

been attributed to the relevant state-of-the-art collaborations with

appropriate citations and acknowledgments, in line with established

norms and practices.

N AME : C HAITI PAUL

ROLL NO: 22342405

SIGNATURE:

N AME : DEEP DAS ROLL

NO: 22342406

SIGNATURE:

i
C ERTIFICATE

T
his is to certify that the work entitled Multisense GPT, a

Question An- swering System, submitted by Chaiti Paul

and Deep Das, has been carried out under my supervision for

the partial fulfillment of the re-

quirements for the award of the Master of Science in Computer Science.

It is understood that by this approval the undersigned do not necessarily

en- dorse any of the statements made or opinions expressed therein but

approves it only for the purpose for which it is submitted.

SIGNATURE OF S UPERVISOR SIGNATURE OF THE E XTERNAL E XAMINER / S


PROF. D EBJANI BHATTACHARJEE

ii
A CKNOWLEDGMENTS

F
irst, I express my gratitude to the Almighty, who blessed me with

the zeal and enthusiasm to complete this research work successfully. I

am extremely thankful to my supervisor Prof. Debjani Bhattacharjee, De-

partment of Computer Science, Acharya Prafulla Chandra College, New

Bar- rackpore, Kolkata for their motivation and tireless efforts to help me to

get deep knowledge of the research area and for supporting me

throughout the life cy- cle of my M.SC. dissertation work. Especially, the

extensive comments, healthy discussions, and fruitful interactions with the

supervisors directly impacted the

final form and quality of my M.SC. dissertation work.

I also thank Prof. Kunal Das, Head of the Computer Science Department,

for her fruitful guidance through the early years of chaos and confusion. I

wish to thank the faculty members and supporting staff of the Computer

Science De- partment for their full support and heartiest cooperation.

This thesis would not have been possible without the hearty support of

my friends. My deepest regards to my Parents for their blessings,

affection, and continuous support. Last but not least, I thank GOD, the

almighty for giving me the inner willingness, strength, and wisdom to carry

out this research work successfully.

iii
A BSTRACT

Q
uestion Answering system(QAS) is an information retrieval

system in which a direct answer is expected in response to a

submitted query, rather than a set of references that may

contain the answers. The

QAS is concerned with providing relevant answers in response to

questions proposed in natural language. This paper focuses on the

design and imple- mentation of an open-domain, interactive QA system.

The proposed system is explained and demonstrated with detailed

examples. The language processing portion of the system uses several

Large-Language Models(LLM) to judge the meaningfulness of questions,

generating dialogue to clarify the response. The system is specifically

designed to effectively handle and provide responses to inquiries that

encompass textual, visual, and auditory content.

Keywords: Natural Language Processing(NLP), Question-Answering

System(QAS), Large Language Model(LLM) Question Processing, Answer

Extraction

iv
Table of Contents

1 Introduction 1

2 Literature Review 3
2.1 LLaVA...................................................................................................................................3
2.2 ggml-model-q5-k..............................................................................................................3
2.3 mmproj-model-f16...........................................................................................................3
2.4 Mistral-7b-instruct-v0.1.Q5-K-M.......................................................................................3
2.5 Whisper AI...........................................................................................................................4
2.6 GPT-2....................................................................................................................................4
2.7 Llama-2-13B-chat-GGML...................................................................................................4

3 Methodology 5

Working model 5

Features 7
3.1 Knowledge Base-Driven Responses Without Input.....................................................7
3.2 PDF Analysis and Question Answering...........................................................................8
3.3 Photo Analysis and Question Answering.....................................................................10
3.4 Video Analysis and Question Answering.....................................................................13
3.5 Session Management....................................................................................................18
3.6 Voice Commands...........................................................................................................19
3.7 Deleting Current Chat Session and Clearing Cache.................................................20

4 Conclusion 21

Bibliography 22

v
Introduction
1
The proliferation of internet usage in conjunction with the remarkable ex-
pansion of data storage capacity has granted us the capability to both
securely archive and effectively disseminate data to the public. However,
sifting through this vast amount of data has made finding information
time-consuming and complex. As a result, there has been a push to develop
new research tools, such as Question Answering Systems, to address this
challenge. Question answer- ing systems show a noteworthy
advancement in information retrieval tech- nologies, especially in their
ability to access knowledge resources naturally by querying and retrieving
the right replies in succinct words.
Since the 1960s, numerous QA Systems (QAS) have been developed to
re- spond to user inquiries through various approaches. These systems
have ad- dressed a wide range of domains, databases, question types, and
answer struc- tures. Modern methods involve retrieving and analyzing
data from multiple sources to effectively respond to questions presented in
natural language. Ques- tion Answering (QA)[1] system is a multidisciplinary
research area that encom- passes Information Retrieval (IR), Information
Extraction (IE), and Natural Lan- guage Processing (NLP). The objective of
a question-answering system is to provide answers to queries rather than
overwhelming users with full documents or the most pertinent passages,
which is the typical functionality of most infor- mation retrieval systems. E.g.
“Who is the first prime minister of India?” the ex-
act answer expected by the user for this question is (Pandit Jawaharlal
Nehru), but not intend to read through the passages or documents that
match with words like first, prime minister, India, etc.
The question-answering system[5] is continually advancing and
improv- ing to meet the challenges and opportunities in this field through
the integra- tion of new trends and innovations. From the initial text-based
programs to the current AI-powered question-answering systems, virtual
assistants, and chat- bots have emerged as sophisticated communication
tools, offering automation and customer support capabilities. Numerous
1
businesses utilize these systems on their websites and social media accounts
to offer round-the-clock customer

2
CHAPTER 1. INTRODUCTION

support without the need for a large team of human agents. Advanced
chatbots are also employed in the healthcare, finance, and education
sectors to deliver personalized assistance and support. Additionally, some
chatbots can collect user data, which can be used to improve marketing
campaigns and person- alize the customer experience. One of the
renowned chatbots in the current market - ChatGPT is a sophisticated
language model designed to enhance the engagement and informativeness
of chatbots. Trained on an extensive dataset comprising text and code,
ChatGPT possesses the capacity to comprehend and produce human-like
text. However, despite their advanced capabilities, these systems are not
without limitations. When used as a QAS without any text for answer
extraction, systems like ChatGPT must rely on its knowledge to generate
answers, which can be strongly influenced by training bias. Also, these
systems cannot interact with audio/video data or PDFs.
We attempted to integrate these types of data into our system. Our
sys- tem can process audio, PDFs, and images to provide human-like
responses to user queries. Without the context to generate responses, it uses
its knowledge base to provide an answer. We have also incorporated a
voice feature into our system to offer a dependable and user-friendly
environment. Furthermore, we can store users’ previous conversation
sessions along with timestamps in the system database, enabling users
to access and review them anytime.

3
Literature Review
2
We selected the models based on their state-of-the-art performance in various NLP
tasks and their availability through trusted providers. The models are chosen to meet our
perfor- mance requirements, handle our data efficiently, and to seamlessly integrate well
into our system architecture. The models that we have utilized to implement our system
are LLaVA, ggml-model-q5-k, mmproj-model-f16, Mistral-7b-instruct-v0.1.Q5-K-M,
Whisper AI, GPT- 2, Llama-2-13B-chat-GGML. few studies on these models are the
following:

2.1 LLaVA
LLaVA(Large Language and Vision Assistant)[4] is an open-source chatbot, an end-
to-end trained large multimodal model that connects a vision encoder and LLM for
general-purpose visual and language understanding, achieving impressive chat
capabilities. It is designed to understand and generate content based on visual
inputs (images) and textual instructions.

2.2 ggml-model-q5-k
The ggml-model-q5-k is utilized to produce responses for queries related to PDFs. It is a file
structure in llava-v1.5-13b[3].

2.3 mmproj-model-f16
The mmproj-model-f16 is utilized to produce responses for queries related to images. It
is a file structure in llava-v1.5-7b.

2.4 Mistral-7b-instruct-v0.1.Q5-K-M
The Mistral-7B-Instruct-v0.1[2] Large Language Model (LLM) is an instruct fine-tuned ver-
sion of the Mistral-7B-v0.1 generative text model using a variety of publicly available
4
con- versation datasets.

5
CHAPTER 2. LITERATURE REVIEW

2.5 Whisper AI
Whisper[6] is a general-purpose speech recognition model. It is trained on a large
dataset of diverse audio and is also a multitasking model that can perform multilingual
speech recog- nition, speech translation, and language identification. Whisper is a
transformer-based encoder-decoder model. It was trained on 1 million hours of
weakly labeled audio and 4 million hours of pseudolabeled audio collected. Whisper
performs the task of speech tran- scription, where the source audio language is the
same as the target text language.

2.6 GPT-2
The GPT-2[7] model was introduced in the paper "Language Models are Unsupervised
Mul- titask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever from OpenAI. It is a causal (unidirectional) transformer that
was pretrained using language modeling on a large dataset of approximately 40 GB of
text. GPT-2 is a large transformer-based language model with 1.5 billion
parameters, and it was trained on a dataset consisting of 8 million web pages.

2.7 Llama-2-13B-chat-GGML
Llama 2[8] is a collection of pre-trained and fine-tuned generative text models
ranging in scale from 7 billion to 70 billion parameters. Llama 2 comes in a range of
parameter sizes
— 7B, 13B, and 70B — as well as pre-trained and fine-tuned variations.

Llama 2-Chat is a family of fine-tuned Llama-2 models that are optimized for dialogue
use cases. These models are specifically designed to generate human-like responses to
nat- ural language input, making them suitable for chatbot and conversational AI
applications.

The Llama2-13B-chat is a 13 billion parameter model and is pre-trained on a large


cor- pus of text that includes conversational data, such as chat logs and social media
posts. This allows the models to learn the patterns and structures of natural language
dialogue and to generate coherent and contextually appropriate responses to user
input.

6
Methodology
3
3.1 System Overview
The proposed system for Indian Classical Instrumental
Raga Detection is a comprehensive, data-driven
architecture that combines advanced audio signal
processing and deep learning techniques. Its design is
inspired by the rich, multidimensional nature of Indian
classical music, which incorporates melody, rhythm, pitch
structures, and modal scales. The system aims to identify
the raga of a given instrumental audio clip with high
accuracy by analyzing both low-level and high-level
musical features.

Key Objectives of the System:


 To process raw audio input and extract meaningful
musical features such as melodic contour, rhythmic cycles
(taal), and tonal centers (tonic).
 To build a robust model that can learn from these
features and classify the raga, even in instrumental
contexts where vocal clues are absent.
 To support visualization and interpretability of the
raga prediction process.

7
CHAPTER 3. METHODOLOGY

Major Components of the System:


1. Audio Preprocessing and Feature Extraction
The raw audio is first loaded and standardized in terms of
sampling rate and duration. Then, a rich set of musical
features is extracted, representing different dimensions of a
raga:
 Spectral features: Mel-spectrogram, MFCC
 Melodic features: Pitch contour, pitch histogram
 Harmonic features: Chroma, Tonnetz
 Rhythmic features: Tempogram, onset envelope (Taal
structure)
 Structural cues: Vadi–Samvadi, tonic estimation
This stage ensures that all essential musical aspects of the
raga are captured before model training.

2. Feature Normalization and Alignment


Since the extracted features vary in length, scale, and
dimensionality, each feature is normalized:
 Pitch contours are interpolated to fixed-length vectors.
 Energy-based features are scaled using min-max or
z-score normalization.
 Feature tensors are then prepared for input into a deep
learning model. This step guarantees consistency and
comparability across different input samples.

3. Multi-Branch Deep Learning Architecture


Given the heterogeneity of the input features, the system
employs a multi- branch neural network architecture:
 Each feature (e.g., mel-spectrogram, pitch contour,
MFCC) is passed through a dedicated sub-network
(CNN, GRU, or dense layer).
 These branches learn to encode specific information:
o CNNs extract local patterns from time-frequency
maps
o GRUs capture sequential dependencies in
pitch/melody
o Dense layers compress statistical features
Each branch outputs a latent representation or
embedding of its input feature.
8
CHAPTER 3. METHODOLOGY
4. Feature Fusion and Classification
The feature embeddings from all branches are
concatenated into a single comprehensive feature vector.
This is passed through one or more fully connected
layers with non-linear activation functions and dropout for
regularization. The final layer uses a softmax function to
output the predicted probability distribution over all
possible raga classes.

5. Model Training and Evaluation


The model is trained using labeled audio samples and
supervised learning:
 Loss Function: Cross-entropy loss
 Optimizer: Adam
 Metrics: Accuracy, precision, recall, F1-score, and
confusion matrix
 Training Techniques: Data augmentation (e.g., pitch
shifting, time stretching) is used to prevent overfitting and
improve generalization.

6. Raga Prediction and Visualization


At inference time, the system:
 Accepts a new audio clip
 Extracts and processes features during training
 Uses the trained model to predict the raga
 Visualizes:
o Mel-spectrogram
o Waveform
o Show predicted raga

System Strengths
 Tonic-invariant: Uses tonic-normalized pitch features.
 Multi-dimensional: Considersmelody,
rhythm, and harmony simultaneously.
 Modular: Easily extendable for new features or musical
styles.
 Instrument-agnostic: Designed specifically for
instrumental music, which lacks syllabic markers present in
vocals.

9
CHAPTER 3. METHODOLOGY

3.2 Data Collection and Preprocessing


An essential step in building any machine learning
system is the careful preparation of the dataset. In the
context of Indian classical music, particularly
instrumental raga detection, it becomes even more
critical due to the diversity in performance styles,
instruments, recording qualities, and tonal references.
This section outlines the process followed for collecting,
organizing, and preprocessing the audio data used in our
model.

1.1.1 Data Collection


The dataset used in this project consists of instrumental
audio recordings of Indian classical music, with each
recording labelled according to its corresponding raga. The
data was curated from a variety of open-access sources,
digital music archives, and personal collections of classical
performances. The selection focused solely on non-vocal
instrumental compositions, ensuring the absence of
lyrics or vocal ornamentations that could otherwise
influence pitch detection and feature extraction.

Sources of Audio Data (limited dataset)


 Kraggle
 Youtube

Each audio clip is associated with metadata including:


 Raga name (label)
 Duration

Data Organization
The audio files were organized in a folder hierarchy such as:
 Data/-
 /-raga bhairav/ .wav files
 /-raga bhoop/ .wav files
 /-raga madhuvanti/ .wav files
 /-raga shudh sarang/ .wav files
 /-raga yaman/ .wav files
10
CHAPTER 3. METHODOLOGY

1.1.2 Preprocessing Pipeline


To ensure consistency and compatibility with the model's
input requirements, the following preprocessing steps were
applied to all audio files:

1. Resampling
All audio signals were resampled to a standard sampling
rate of 22,050 Hz to match the input requirements of
librosa and reduce computational overhead. This rate is
sufficient for capturing essential harmonic and melodic
content up to 11 kHz, which includes all necessary frequency
information for Indian classical instruments.

2. Mono Conversion
Recordings with stereo channels were converted to mono by
averaging the two channels. This simplifies analysis and
removes any spatial or panning information which is
irrelevant for raga recognition.

3. Duration Normalization
To maintain uniform input dimensions:
 Audio clips were trimmed to a maximum duration of X
seconds
 Shorter clips were zero-padded to match the target
duration
This step ensures that extracted features (e.g., mel-spectrograms,
pitch contours, etc) have consistent temporal dimensions across all
samples.

11
CHAPTER 3. METHODOLOGY

1.1.3Labeling for future usage and Dataset


Splitting
Each folder name corresponds to a raga class. This
directory structure allowed automatic parsing of labels during
training.

Train–Validation–Test Split:
Use multiple splits for better distribution and accuracy:
Example1:
o Training set: 70%
o Validation set: 15%
o Test set: 15%
Example2:
o Training set: 80%
o Validation set: 10%
o Test set: 10%

Stratified sampling ensured balanced raga distribution


dynamically across splits.

Note :
The preprocessing and data preparation phase ensured that the
raw audio recordings were converted into a clean, consistent,
and augmented format suitable for deep learning. Careful
attention was given to maintain musical integrity, especially in
preserving the tonic reference and melodic content that are
crucial for successful raga detection.

12
CHAPTER 3. METHODOLOGY

3.3 Feature Extraction


Feature extraction is the core component of this raga
detection system. Indian classical ragas are defined by
melodic movement, tonal structure, rhythmic patterns,
and emphasis on specific notes. Therefore, a rich set of
time-series and statistical features is extracted using the
librosa library and custom signal processing functions.
All features are extracted from a resampled, mono, and
amplitude- normalized audio signal.

1.1.1 Tonic Detection


Librosa Function: librosa.piptrack(y, sr)
We extract the tonic (base frequency corresponding to 'Sa')
from the pitch salience spectrogram. The most frequently
occurring peak in pitch is assumed to be the tonic.
Steps:

Use in System: Tonic is used to convert absolute pitch to


relative pitch in cents, making features tonic-invariant,
which is critical for raga modeling.

13
CHAPTER 3. METHODOLOGY

1.1.2 Mel-Spectrogram
Librosa Function: librosa.feature.melspectrogram(y=y,
sr=sr, n_mels=N)
Description: A perceptually scaled spectrogram emphasizing
human auditory resolution.

Use in the system: Captures the timbral texture of the


instrument and the overall spectral envelope, aiding in
distinguishing between raga types and instrument
articulations.

1.1.4 MFCC + Delta-MFCC


Librosa Function: librosa.feature.mfcc(),
librosa.feature.delta() Description: Compact
representation of the spectrum using
cepstral coefficients. Delta captures the derivative (velocity).
Equation:

Use: Encodes instrument characteristics and articulation


over time — important for distinguishing how a raga is
rendered on different instruments.
14
CHAPTER 3. METHODOLOGY

15
CHAPTER 3. METHODOLOGY

1.1.5 Chroma Features


Librosa Function: librosa.feature.chroma_stft(y, sr)
Description: Maps energy to 12 pitch classes (C, C#, D, ...,
B), regardless of octave.
Equation:

Where Pn is the set of frequencies corresponding to pitch class


n.
Use in the system: Represents scale structure
(Arohana/Avarohana) and note emphasis, crucial for raga
differentiation.

1.1.6 Tonnetz (Tonal Centroids)


Librosa Function: librosa.feature.tonnetz(y=y, sr=sr)
Description: Captures harmonic relations such as
perfect fifths, minor thirds, and major thirds.
Use in the system: Helps in identifying harmonic
characteristics of ragas, especially when played on
harmonic-rich instruments.

3.3.7 Pitch Histogram


Librosa Function: Custom histogram on pitch contour (in
cents)
Description: Represents distribution of note occurrences
relative to tonic.
Use in the system: Detects swara usage patterns.
Supports Vadi–Samvadi detection.

16
CHAPTER 3. METHODOLOGY

3.3.8 Rhythm Profile (Tempogram)


Librosa Function:
 librosa.onset.onset_strength(y, sr)
 librosa.feature.tempogram(onset_envelope)

Use in the system: Models Taal structure and rhythm,


which differentiate ragas sharing melodic patterns.

3.3.9 Vadi–Samvadi Estimation


Librosa Function: None (custom logic from pitch histogram)
Logic:
 Vadi and Samvadi are inferred as the two most
prominent bins in the pitch histogram.
 Sorted using np.argsort(histogram)[-2:].
Use: Encodes central emotional and structural swaras of
the raga.

3.3.10 Tonic (Hz)


Used for pitch normalization.
Extracted with pitch histogram from piptrack().

17
CHAPTER 3. METHODOLOGY

3.4 Feature Normalization and


Resampling
To ensure uniformity across input samples, features are
resampled and normalized.

1.1.7 Interpolation of Pitch Contour


Method: Linear interpolation using scipy.interpolate.interp1d
 Each pitch contour is interpolated to fixed length
(e.g., 216)
regardless of actual duration.
 This allows the model to learn on a fixed input size.

1.1.8 Normalization Techniques


Min-Max Normalization (e.g., for pitch histogram,
rhythm profile):

Z-Score Normalization (e.g., for MFCC, mel-spectrogram):

Purpose:
 Prevents features with large values (e.g., energy-
based) from dominating others.
 Accelerates training convergence in neural networks.

18
CHAPTER 3. METHODOLOGY

1.1.9 Padding and Truncation


 Mel-spectrograms and MFCCs are truncated or zero-
padded to a fixed number of frames.
 Makes input dimensions consistent across all samples.

1.1.10 Tensor Conversion


All processed features are converted to torch.Tensor
objects with standardized dtype=torch.float32,
making them ready for training.

19
CHAPTER 3. METHODOLOGY

1.2 Multi - Branch Deep Learning


Architecture
The proposed model is a multi-branch deep neural
network, where each branch is dedicated to learning from
a specific group of audio features. This architectural choice
allows the system to extract rich, domain- specific
representations from various aspects of the audio—such
as timbre, pitch, melody, harmony, and rhythm—before
merging them for final classification.

1.2.1 Architectural Rationale


Indian classical ragas are complex musical entities defined
not only by scale patterns, but also by specific melodic
contours (pakad), rhythm cycles (taal), and note emphasis
(vadi–samvadi). No single feature captures this entirely.
Hence, the model processes:
• Time-frequency features with Convolutional Neural
Networks (CNN)
• Sequential melodic features with Recurrent Neural
Networks (GRU)
• Statistical and structural features with Multi-Layer
Perceptrons (MLP)
These feature embeddings are fused into a final feature
vector, which is passed through fully connected layers to
predict the raga.

20
CHAPTER 3. METHODOLOGY

Model Type
Feature Type Feature Name Used Why?
Time- Mel- Multilayer Learns spatial
Frequency Spectrogram CNN (time-frequency)
Map patterns
Time-Series MFCC (+Delta) CNN + GRU Captures temporal
Sequence evolution of
timbre
Time-Series Pitch Contour Multilayer Captures
Sequence (Cents) GRU sequential melodic
flow
Vector Pitch MLP (Fully Encodes note
(Statistical) Histogram Connected) usage distribution
Vector Chroma, MLP Encodes harmonic
(Statistical) Tonnetz structure
Vector Rhythm MLP Represents beat
(Statistical) (Tempogram) and rhythmic
cycle
Vector Vadi–Samvadi MLP Categorical and
(Categorical) + Tonic scalar musical
metadata

21
CHAPTER 3. METHODOLOGY

3.5.2 Branch 1: CNN on Mel-Spectrogram


What it does:
This branch analyzes the mel-spectrogram, which is a heatmap-
like image that shows how frequencies evolve over time. This helps
the model learn instrument timbre, texture, and raga-specific
spectral patterns.
Purpose:
Captures local spectral patterns and instrument-specific
timbre from 2D time–frequency representations.
Input:
A matrix of shape mel_bins × time_frames (like a 128×216
image)
Mel-spectrogram matrix X ∈ R M × T where:
• M : number of mel bins (e.g., 128)
• T : time frames

Layer-wise Operations:
(1) 2D Convolution Layer
Applies a set of filters (small windows) that slide over the mel-
spectrogram and extract patterns.
Equation:
( 1)
hij =σ (∑ W ⋅ X +b )

Where:
• X = input mel-spectrogram

• W = learned filters

• b = bias

• σ = activation (like ReLU)

22
CHAPTER 3. METHODOLOGY

Applies a set of filters W ∈ R k m ×k t


over the mel-spectrogram:

(∑ ∑ )
k m −1 kt −1
( 1)
h =σ
ij W mt ⋅ X i +m , j+t + b
m=0 t =0

Where:
• σ : activation function (ReLU)
• b : bias

(2) Batch Normalization + ReLU

• Normalizes output for better training


• ReLU adds non-linearity

(3) Max Pooling


• Downsamples the output by picking the maximum in small
regions:
( 2) (1 )
hij =max hi+m , j +t
region

( 2) ( 1)
hij = max hi +m , j+t
( m ,t ) ∈ pool

Reduces dimensionality, retains important features.

(4) Dropout (Regularization)


Randomly zeroes out neurons during training to prevent overfitting.

(5) Flatten → Fully Connected Layer


Flattens and projects to a latent vector z mel ∈ Rd
• Flattens the image and transforms it into a 1D vector (feature
embedding).
Output:
• A vector z mel that encodes spectral features of the input audio.

23
CHAPTER 3. METHODOLOGY

3.5.3 Branch 2: CNN + GRU on MFCC


What it does:
Extracts both short-term patterns (with CNN) and long-term
sequential dependencies (with GRU) from MFCCs and their
derivatives.
Purpose:
Encodes timbre, temporal envelope, and sequential variations
using both CNN and RNN.
Input:
• MFCC matrix of shape num_features × time_frames (e.g., 26 ×
216)
MFCC matrix with delta: X ∈ R F ×T
Layer-wise Operations:
(1) 1D Convolution over time axis
• Detects local variations in features across time.
Equation:
( 1)
ht =σ ( ∑ W f ⋅ X f , t +b )

(∑ )
k−1
( 1)
h =σ
t W f ⋅ X f ,t + b
f =0

24
CHAPTER 3. METHODOLOGY
(2) GRU- Gated Recurrent Unit Layer (RNN)
GRU is a type of RNN that processes sequences and remembers
important information.
Models sequential dependency across time:
GRU Update Equations:
• Update gate: how much to keep from the past
z t =σ ( W z x t + U z h t−1 )

• Reset gate: how much to forget


r t =σ ( W r xt +U r ht −1 )

• Candidate state:
~
h t=tanh ( W h x t +U h ( r t ⊙ ht−1 ) )

• Final state:
~
ht =( 1−z t ) ⊙ht −1 + z t ⊙ ht

Where:
• x t : input at time t
• ht : hidden state
• z t : update gate
• r t : reset gate

(3) Final Hidden State → Dense Layer


The final output vector z mfcc represents the timbral-temporal
encoding
Output embedding vector z mfcc ∈ R d

25
CHAPTER 3. METHODOLOGY

3.5.4 Branch 3: GRU on Pitch Contour


What it does:
Processes the pitch contour (melody line) over time to learn the
shape and grammar of the raga.
Purpose:
Captures the melodic progression of the raga (Arohana–
Avarohana), swara transitions, and ornamentations.
Input:
• 1D vector of pitch in cents (normalized with respect to tonic)
Pitch contour (in cents), a 1D time-series vector: x=[ x 1 , x 2 ,... , x T ]
Layer-wise Operations:
(1) GRU Layer
• Learns the sequence of pitch movements (Arohana–Avarohana,
Pakad)
Same GRU update rules as above. Output sequence [ h1 , h2 ,... , hT ]
(2) Last Hidden State or Average Pooling of hidden states:
• Takes either the last GRU output or average of all hidden
states:
T
1
z pitch = ∑h
T t =1 t

26
CHAPTER 3. METHODOLOGY

3.5.5 Branch 4: MLP on Statistical Features


What it does:
Processes non-sequential features such as:
• Pitch histogram (swara distribution)
• Chroma (note classes)
• Tonnetz (harmonic distance)
• Rhythm profile (taal)
• Vadi–Samvadi
• Tonic (Hz)
Purpose:
Models swara importance, rhythm profile, and other high-level
statistics.
Input:
• A 1D feature vector combining all of the above
Concatenated feature vector:
x=[ pitch_histogram , chroma , tonnetz , rhythm , vadi–samvadi , tonic ]

Layers:
(1) Dense Layer:
( 1)
h =σ ( W 1 x +b 1)

(2) Dropout + ReLU


(3) Dense Layer:
z mlp=σ ( W 2 h( 1)+ b2 )

Output:
• A learned vector that summarizes the structural musical
features.

27
CHAPTER 3. METHODOLOGY

3.5.6 Feature Fusion Layer


What it does:
Concatenates the outputs from all branches:
z fused= [ z mel , z mfcc , z pitch , z mlp ]

This combined vector has comprehensive knowledge of the


spectral, melodic, temporal, and statistical structure of the
raga.

3.5.7 Fully Connected Layers (Classifier)


(1) Dense Layer:
h=σ ( W fc ⋅ z fused +b fc )

(2) Dropout
(3) Output Layer:
^y =softmax ( W out ⋅ h+ bout )

Where:
• ^y is the predicted probability for each raga class

3.5.8 Loss Function (Categorical Cross-Entropy)


What it does:
Measures how far the predicted class is from the true raga label.
Equation:
C
LCE =−∑ y i ⋅ log ( ^y i )
i=1

Where:
• C : Number of raga classes
• y i: Ground truth (one-hot)
• ^y i: Predicted probability
28
CHAPTER 3. METHODOLOGY

Summary
Branch Learns from… Learns what…
CNN on Mel Spectrogram Instrument sound, texture
CNN + GRU on Cepstral + sequential Timbre + temporal change
MFCC info
GRU on Pitch Pitch contour Raga melody pattern, pakad
MLP on Stats Global features Swara use, harmony, rhythm,
tonic
Fusion → All of the above Predicts the most likely raga
Classifier

Why This Architecture Works for Raga Detection


Importance for
Component What It Learns Raga
CNN on Mel Instrumental timbre & Timbre-rich clues
spectral patterns
CNN+GRU on Envelope changes and Artist/instrument
MFCC sequential tone shifts styles
GRU on Pitch Melodic grammar & phrase Pakad & Arohana
curves
MLP on Stats Taal, tonic, swara emphasis Global structure

Advantages of This Architecture


Feature Benefit
Multi-Branch Design Specializes for different types of input
representations
CNNs for Extract localized spectral features robust
Spectrograms to noise
GRUs for Melody Captures raga flow and phrase transitions
(Pakad)
MLP for Statistics Simple and efficient modeling of flat
distributions
Tonic-Normalized Handles pitch shifts across artists and
Modeling instruments
29
CHAPTER 3. METHODOLOGY

30
CHAPTER 3. METHODOLOGY

• R ESPONSE GENERATION

The system generates a response based on the analyzed query and provides
the answer in text format. Optionally, the response can also be delivered in
spoken form using text-to-speech technology.

3.2 Deleting Current Chat Session and Clearing


Cache
Our system includes features for deleting the current chat session and clearing the cache.
These functionalities allow users to manage their interactions and ensure a fresh start
when- ever needed.

Functional overview
• D ELETING C URRENT CHAT S ESSION

Users can delete their current chat session through the user interface. This re-
moves the ongoing conversation, allowing users to start anew without
previous context.

Figure 3.21: Deleting current chat session and clearing cache

• C LEARING CACHE

Users have the option to clear the system’s cache. This removes all
temporary data, ensuring that the system starts fresh with no residual
information from past interactions.

31
Conclusion
4
The results of this QA system using the NLP project were very
promising and well produced as we intended. This approach involves a
large number of questions processing simultaneously and a large amount of
data given as input to the question and answering system. It involves Natural
language processing that is performed based on linguistic analysis. The
question is given in Natural language through the input device where the
system treats it as a query and re- turns the best suitable statement or
sentence as the Output.
The Question and answering system produces only the requested
infor- mation instead of searching the whole database or whole
documents as the search engine does, which is generally a time-consuming
process. In our daily life, information is increasing rapidly so extracting
even the required piece of information requires many resources.

32
Bibliography

[1] A. ALLAM AND M. HAGGAG, The question answering systems: A survey,


International Journal of Research and Reviews in Information Sciences, 2 (2012),
pp. 211–221.

[2] A. Q. JIANG, A. S ABLAYROLLES , A. M ENSCH , C. BAMFORD, D. S. CHAPLOT, D. DE LAS

CASAS, F. BRESSAND, G. L ENGYEL , G. L AMPLE , L. SAULNIER, L. R. LAVAUD, M.-A.


LACHAUX, P. STOCK, T. L. SCAO, T. LAVRIL, T. WANG, T. LACROIX, AND W. E. SAYED,
Mistral 7b, 2023.

[3] H. LIU, C. LI, Y. LI, AND Y. J. LEE, Improved baselines with visual instruction
tuning, 2024.

[4] H. LIU, C. LI, Q. WU, AND Y. J. LEE, Visual instruction tuning, 2023.

[5] B. OJOKOH AND E. ADEBISI, A review of question answering systems, Journal of Web En-
gineering, 17 (2018), pp. 717–758.

[6] A. RADFORD, J. W. KIM, T. XU, G. B ROCKMAN , C. MCLEAVEY, AND I. S UTSKEVER , Robust


speech recognition via large-scale weak supervision, 2022.

[7] A. RADFORD, J. WU, R. CHILD, D. LUAN, D. A MODEI , AND I. S UTSKEVER , Language


models are unsupervised multitask learners, (2019).

[8] H. TOUVRON, L. MARTIN, K. STONE, P. ALBERT, A. A LMAHAIRI , Y. B ABAEI , N. BASH-


LYKOV, S. BATRA, P. BHARGAVA, S. B HOSALE , D. B IKEL , L. B LECHER , C. C. FER-
RER, M. CHEN, G. CUCURULL, D. ESIOBU, J. FERNANDES, J. FU, W. FU, B. FULLER,
C. GAO, V. GOSWAMI, N. GOYAL, A. H ARTSHORN , S. H OSSEINI , R. HOU, H. INAN,
M. KARDAS, V. KERKEZ, M. KHABSA, I. KLOUMANN, A. KORENEV, P. S. KOURA, M.-
A. LACHAUX, T. LAVRIL, J. LEE, D. LISKOVICH, Y. LU, Y. MAO, X. MARTINET, T. MI-
HAYLOV, P. MISHRA, I. MOLYBOG, Y. NIE, A. POULTON, J. REIZENSTEIN, R. RUNGTA,
K. S ALADI , A. SCHELTEN, R. SILVA, E. M. S MITH , R. S UBRAMANIAN , X. E. TAN,
B. TANG, R. TAYLOR, A. W ILLIAMS , J. X. KUAN, P. XU, Z. YAN, I. ZAROV, Y. ZHANG,
A. FAN, M. K AMBADUR , S. N ARANG , A. R ODRIGUEZ , R. S TOJNIC , S. EDUNOV, AND
T. SCIALOM, Llama 2: Open foundation and fine-tuned chat models, 2023.

33

You might also like