0% found this document useful (0 votes)

12 views39 pages

QAS Final Report-2

The dissertation presents 'Multisense GPT, a Question Answering System,' developed by Chaiti Paul and Deep Das under the supervision of Prof. Debjani Bhattacharjee, aiming to enhance information retrieval through an interactive QA system that processes textual, visual, and auditory data. The system utilizes various Large Language Models to generate meaningful responses to natural language queries, addressing the limitations of existing QA systems. The document includes acknowledgments, a literature review, methodology, and a detailed exploration of the system's capabilities and components.

Uploaded by

wefot57202

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views39 pages

QAS Final Report-2

Uploaded by

wefot57202

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Multisense GPT, a Question

Answering System
A DISSERTATION

submitted in partial fulfillment of the

requirements for the award of the degree of

Master of Science
in
Computer Science

Chaiti Paul
Student roll:

22332305 Deep

Das
Student roll: 22332306

Under the supervision of

Prof. Debjani Bhattacharjee

DEPARTMENT OF COMPUTER SCIENCE

ACHARYA PRAFULLA CHANDRA COLLEGE
WEST BENGAL STATE UNIVERSITY
July 2024
D ECLARATION

W
e,Chaiti Paul and textitDeep Das declare that this that the

work entitled textitMultisense GPT, a Question

Answering System in the thesis is original and has been

done by ourselves under the

supervision of Prof. Debjani Bhattacharjee. The work has not been

submitted for a degree elsewhere, in part or in full. Further, due credit has

been attributed to the relevant state-of-the-art collaborations with

appropriate citations and acknowledgments, in line with established

norms and practices.

N AME : C HAITI PAUL

ROLL NO: 22342405

SIGNATURE:

N AME : DEEP DAS ROLL

NO: 22342406

SIGNATURE:

i
C ERTIFICATE

T
his is to certify that the work entitled Multisense GPT, a

Question An- swering System, submitted by Chaiti Paul

and Deep Das, has been carried out under my supervision for

the partial fulfillment of the re-

quirements for the award of the Master of Science in Computer Science.

It is understood that by this approval the undersigned do not necessarily

en- dorse any of the statements made or opinions expressed therein but

approves it only for the purpose for which it is submitted.

SIGNATURE OF S UPERVISOR SIGNATURE OF THE E XTERNAL E XAMINER / S

PROF. D EBJANI BHATTACHARJEE

ii
A CKNOWLEDGMENTS

F
irst, I express my gratitude to the Almighty, who blessed me with

the zeal and enthusiasm to complete this research work successfully. I

am extremely thankful to my supervisor Prof. Debjani Bhattacharjee, De-

partment of Computer Science, Acharya Prafulla Chandra College, New

Bar- rackpore, Kolkata for their motivation and tireless efforts to help me to

get deep knowledge of the research area and for supporting me

throughout the life cycle of my M.SC. dissertation work. Especially, the

extensive comments, healthy discussions, and fruitful interactions with the

supervisors directly impacted the

final form and quality of my M.SC. dissertation work.

I also thank Prof. Kunal Das, Head of the Computer Science Department,

for her fruitful guidance through the early years of chaos and confusion. I

wish to thank the faculty members and supporting staff of the Computer

Science De- partment for their full support and heartiest cooperation.

This thesis would not have been possible without the hearty support of

my friends. My deepest regards to my Parents for their blessings,

affection, and continuous support. Last but not least, I thank GOD, the

almighty for giving me the inner willingness, strength, and wisdom to carry

out this research work successfully.

iii
A BSTRACT

Q
uestion Answering system(QAS) is an information retrieval

system in which a direct answer is expected in response to a

submitted query, rather than a set of references that may

contain the answers. The

QAS is concerned with providing relevant answers in response to

questions proposed in natural language. This paper focuses on the

design and imple- mentation of an open-domain, interactive QA system.

The proposed system is explained and demonstrated with detailed

examples. The language processing portion of the system uses several

Large-Language Models(LLM) to judge the meaningfulness of questions,

generating dialogue to clarify the response. The system is specifically

designed to effectively handle and provide responses to inquiries that

encompass textual, visual, and auditory content.

Keywords: Natural Language Processing(NLP), Question-Answering

System(QAS), Large Language Model(LLM) Question Processing, Answer

Extraction

iv
Table of Contents

1 Introduction 1

2 Literature Review 3
2.1 LLaVA...................................................................................................................................3
2.2 ggml-model-q5-k..............................................................................................................3
2.3 mmproj-model-f16...........................................................................................................3
2.4 Mistral-7b-instruct-v0.1.Q5-K-M.......................................................................................3
2.5 Whisper AI...........................................................................................................................4
2.6 GPT-2....................................................................................................................................4
2.7 Llama-2-13B-chat-GGML...................................................................................................4

3 Methodology 5

Working model 5

Features 7
3.1 Knowledge Base-Driven Responses Without Input.....................................................7
3.2 PDF Analysis and Question Answering...........................................................................8
3.3 Photo Analysis and Question Answering.....................................................................10
3.4 Video Analysis and Question Answering.....................................................................13
3.5 Session Management....................................................................................................18
3.6 Voice Commands...........................................................................................................19
3.7 Deleting Current Chat Session and Clearing Cache.................................................20

4 Conclusion 21

Bibliography 22

v
Introduction
1
The proliferation of internet usage in conjunction with the remarkable ex-
pansion of data storage capacity has granted us the capability to both
securely archive and effectively disseminate data to the public. However,
sifting through this vast amount of data has made finding information
time-consuming and complex. As a result, there has been a push to develop
new research tools, such as Question Answering Systems, to address this
challenge. Question answering systems show a noteworthy
advancement in information retrieval tech- nologies, especially in their
ability to access knowledge resources naturally by querying and retrieving
the right replies in succinct words.
Since the 1960s, numerous QA Systems (QAS) have been developed to
respond to user inquiries through various approaches. These systems
have ad- dressed a wide range of domains, databases, question types, and
answer structures. Modern methods involve retrieving and analyzing
data from multiple sources to effectively respond to questions presented in
natural language. Ques- tion Answering (QA)[1] system is a multidisciplinary
research area that encom- passes Information Retrieval (IR), Information
Extraction (IE), and Natural Lan- guage Processing (NLP). The objective of
a question-answering system is to provide answers to queries rather than
overwhelming users with full documents or the most pertinent passages,
which is the typical functionality of most information retrieval systems. E.g.
“Who is the first prime minister of India?” the ex-
act answer expected by the user for this question is (Pandit Jawaharlal
Nehru), but not intend to read through the passages or documents that
match with words like first, prime minister, India, etc.
The question-answering system[5] is continually advancing and
improv- ing to meet the challenges and opportunities in this field through
the integra- tion of new trends and innovations. From the initial text-based
programs to the current AI-powered question-answering systems, virtual
assistants, and chatbots have emerged as sophisticated communication
tools, offering automation and customer support capabilities. Numerous
1
businesses utilize these systems on their websites and social media accounts
to offer round-the-clock customer

2
CHAPTER 1. INTRODUCTION

support without the need for a large team of human agents. Advanced
chatbots are also employed in the healthcare, finance, and education
sectors to deliver personalized assistance and support. Additionally, some
chatbots can collect user data, which can be used to improve marketing
campaigns and person- alize the customer experience. One of the
renowned chatbots in the current market - ChatGPT is a sophisticated
language model designed to enhance the engagement and informativeness
of chatbots. Trained on an extensive dataset comprising text and code,
ChatGPT possesses the capacity to comprehend and produce human-like
text. However, despite their advanced capabilities, these systems are not
without limitations. When used as a QAS without any text for answer
extraction, systems like ChatGPT must rely on its knowledge to generate
answers, which can be strongly influenced by training bias. Also, these
systems cannot interact with audio/video data or PDFs.
We attempted to integrate these types of data into our system. Our
system can process audio, PDFs, and images to provide human-like
responses to user queries. Without the context to generate responses, it uses
its knowledge base to provide an answer. We have also incorporated a
voice feature into our system to offer a dependable and user-friendly
environment. Furthermore, we can store users’ previous conversation
sessions along with timestamps in the system database, enabling users
to access and review them anytime.

3
Literature Review
2
We selected the models based on their state-of-the-art performance in various NLP
tasks and their availability through trusted providers. The models are chosen to meet our
performance requirements, handle our data efficiently, and to seamlessly integrate well
into our system architecture. The models that we have utilized to implement our system
are LLaVA, ggml-model-q5-k, mmproj-model-f16, Mistral-7b-instruct-v0.1.Q5-K-M,
Whisper AI, GPT- 2, Llama-2-13B-chat-GGML. few studies on these models are the
following:

2.1 LLaVA
LLaVA(Large Language and Vision Assistant)[4] is an open-source chatbot, an end-
to-end trained large multimodal model that connects a vision encoder and LLM for
general-purpose visual and language understanding, achieving impressive chat
capabilities. It is designed to understand and generate content based on visual
inputs (images) and textual instructions.

2.2 ggml-model-q5-k
The ggml-model-q5-k is utilized to produce responses for queries related to PDFs. It is a file
structure in llava-v1.5-13b[3].

2.3 mmproj-model-f16
The mmproj-model-f16 is utilized to produce responses for queries related to images. It
is a file structure in llava-v1.5-7b.

2.4 Mistral-7b-instruct-v0.1.Q5-K-M
The Mistral-7B-Instruct-v0.1[2] Large Language Model (LLM) is an instruct fine-tuned ver-
sion of the Mistral-7B-v0.1 generative text model using a variety of publicly available
4
conversation datasets.

5
CHAPTER 2. LITERATURE REVIEW

2.5 Whisper AI
Whisper[6] is a general-purpose speech recognition model. It is trained on a large
dataset of diverse audio and is also a multitasking model that can perform multilingual
speech recognition, speech translation, and language identification. Whisper is a
transformer-based encoder-decoder model. It was trained on 1 million hours of
weakly labeled audio and 4 million hours of pseudolabeled audio collected. Whisper
performs the task of speech tran- scription, where the source audio language is the
same as the target text language.

2.6 GPT-2
The GPT-2[7] model was introduced in the paper "Language Models are Unsupervised
Mul- titask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever from OpenAI. It is a causal (unidirectional) transformer that
was pretrained using language modeling on a large dataset of approximately 40 GB of
text. GPT-2 is a large transformer-based language model with 1.5 billion
parameters, and it was trained on a dataset consisting of 8 million web pages.

2.7 Llama-2-13B-chat-GGML
Llama 2[8] is a collection of pre-trained and fine-tuned generative text models
ranging in scale from 7 billion to 70 billion parameters. Llama 2 comes in a range of
parameter sizes
— 7B, 13B, and 70B — as well as pre-trained and fine-tuned variations.

Llama 2-Chat is a family of fine-tuned Llama-2 models that are optimized for dialogue
use cases. These models are specifically designed to generate human-like responses to
natural language input, making them suitable for chatbot and conversational AI
applications.

The Llama2-13B-chat is a 13 billion parameter model and is pre-trained on a large

cor- pus of text that includes conversational data, such as chat logs and social media
posts. This allows the models to learn the patterns and structures of natural language
dialogue and to generate coherent and contextually appropriate responses to user
input.

6
Methodology
3
3.1 System Overview
The proposed system for Indian Classical Instrumental
Raga Detection is a comprehensive, data-driven
architecture that combines advanced audio signal
processing and deep learning techniques. Its design is
inspired by the rich, multidimensional nature of Indian
classical music, which incorporates melody, rhythm, pitch
structures, and modal scales. The system aims to identify
the raga of a given instrumental audio clip with high
accuracy by analyzing both low-level and high-level
musical features.

Key Objectives of the System:

 To process raw audio input and extract meaningful
musical features such as melodic contour, rhythmic cycles
(taal), and tonal centers (tonic).
 To build a robust model that can learn from these
features and classify the raga, even in instrumental
contexts where vocal clues are absent.
 To support visualization and interpretability of the
raga prediction process.

7
CHAPTER 3. METHODOLOGY

Major Components of the System:

1. Audio Preprocessing and Feature Extraction
The raw audio is first loaded and standardized in terms of
sampling rate and duration. Then, a rich set of musical
features is extracted, representing different dimensions of a
raga:
 Spectral features: Mel-spectrogram, MFCC
 Melodic features: Pitch contour, pitch histogram
 Harmonic features: Chroma, Tonnetz
 Rhythmic features: Tempogram, onset envelope (Taal
structure)
 Structural cues: Vadi–Samvadi, tonic estimation
This stage ensures that all essential musical aspects of the
raga are captured before model training.

2. Feature Normalization and Alignment

Since the extracted features vary in length, scale, and
dimensionality, each feature is normalized:
 Pitch contours are interpolated to fixed-length vectors.
 Energy-based features are scaled using min-max or
z-score normalization.
 Feature tensors are then prepared for input into a deep
learning model. This step guarantees consistency and
comparability across different input samples.

3. Multi-Branch Deep Learning Architecture

Given the heterogeneity of the input features, the system
employs a multi- branch neural network architecture:
 Each feature (e.g., mel-spectrogram, pitch contour,
MFCC) is passed through a dedicated sub-network
(CNN, GRU, or dense layer).
 These branches learn to encode specific information:
o CNNs extract local patterns from time-frequency
maps
o GRUs capture sequential dependencies in
pitch/melody
o Dense layers compress statistical features
Each branch outputs a latent representation or
embedding of its input feature.
8
CHAPTER 3. METHODOLOGY
4. Feature Fusion and Classification
The feature embeddings from all branches are
concatenated into a single comprehensive feature vector.
This is passed through one or more fully connected
layers with non-linear activation functions and dropout for
regularization. The final layer uses a softmax function to
output the predicted probability distribution over all
possible raga classes.

5. Model Training and Evaluation

The model is trained using labeled audio samples and
supervised learning:
 Loss Function: Cross-entropy loss
 Optimizer: Adam
 Metrics: Accuracy, precision, recall, F1-score, and
confusion matrix
 Training Techniques: Data augmentation (e.g., pitch
shifting, time stretching) is used to prevent overfitting and
improve generalization.

6. Raga Prediction and Visualization

At inference time, the system:
 Accepts a new audio clip
 Extracts and processes features during training
 Uses the trained model to predict the raga
 Visualizes:
o Mel-spectrogram
o Waveform
o Show predicted raga

System Strengths
 Tonic-invariant: Uses tonic-normalized pitch features.
 Multi-dimensional: Considersmelody,
rhythm, and harmony simultaneously.
 Modular: Easily extendable for new features or musical
styles.
 Instrument-agnostic: Designed specifically for
instrumental music, which lacks syllabic markers present in
vocals.

9
CHAPTER 3. METHODOLOGY

3.2 Data Collection and Preprocessing

An essential step in building any machine learning
system is the careful preparation of the dataset. In the
context of Indian classical music, particularly
instrumental raga detection, it becomes even more
critical due to the diversity in performance styles,
instruments, recording qualities, and tonal references.
This section outlines the process followed for collecting,
organizing, and preprocessing the audio data used in our
model.

1.1.1 Data Collection

The dataset used in this project consists of instrumental
audio recordings of Indian classical music, with each
recording labelled according to its corresponding raga. The
data was curated from a variety of open-access sources,
digital music archives, and personal collections of classical
performances. The selection focused solely on non-vocal
instrumental compositions, ensuring the absence of
lyrics or vocal ornamentations that could otherwise
influence pitch detection and feature extraction.

Sources of Audio Data (limited dataset)

 Kraggle
 Youtube

Each audio clip is associated with metadata including:

 Raga name (label)
 Duration

Data Organization
The audio files were organized in a folder hierarchy such as:
 Data/-
 /-raga bhairav/ .wav files
 /-raga bhoop/ .wav files
 /-raga madhuvanti/ .wav files
 /-raga shudh sarang/ .wav files
 /-raga yaman/ .wav files
10
CHAPTER 3. METHODOLOGY

1.1.2 Preprocessing Pipeline

To ensure consistency and compatibility with the model's
input requirements, the following preprocessing steps were
applied to all audio files:

1. Resampling
All audio signals were resampled to a standard sampling
rate of 22,050 Hz to match the input requirements of
librosa and reduce computational overhead. This rate is
sufficient for capturing essential harmonic and melodic
content up to 11 kHz, which includes all necessary frequency
information for Indian classical instruments.

2. Mono Conversion
Recordings with stereo channels were converted to mono by
averaging the two channels. This simplifies analysis and
removes any spatial or panning information which is
irrelevant for raga recognition.

3. Duration Normalization
To maintain uniform input dimensions:
 Audio clips were trimmed to a maximum duration of X
seconds
 Shorter clips were zero-padded to match the target
duration
This step ensures that extracted features (e.g., mel-spectrograms,
pitch contours, etc) have consistent temporal dimensions across all
samples.

11
CHAPTER 3. METHODOLOGY

1.1.3Labeling for future usage and Dataset

Splitting
Each folder name corresponds to a raga class. This
directory structure allowed automatic parsing of labels during
training.

Train–Validation–Test Split:
Use multiple splits for better distribution and accuracy:
Example1:
o Training set: 70%
o Validation set: 15%
o Test set: 15%
Example2:
o Training set: 80%
o Validation set: 10%
o Test set: 10%

Stratified sampling ensured balanced raga distribution

dynamically across splits.

Note :
The preprocessing and data preparation phase ensured that the
raw audio recordings were converted into a clean, consistent,
and augmented format suitable for deep learning. Careful
attention was given to maintain musical integrity, especially in
preserving the tonic reference and melodic content that are
crucial for successful raga detection.

12
CHAPTER 3. METHODOLOGY

3.3 Feature Extraction

Feature extraction is the core component of this raga
detection system. Indian classical ragas are defined by
melodic movement, tonal structure, rhythmic patterns,
and emphasis on specific notes. Therefore, a rich set of
time-series and statistical features is extracted using the
librosa library and custom signal processing functions.
All features are extracted from a resampled, mono, and
amplitude- normalized audio signal.

1.1.1 Tonic Detection

Librosa Function: librosa.piptrack(y, sr)
We extract the tonic (base frequency corresponding to 'Sa')
from the pitch salience spectrogram. The most frequently
occurring peak in pitch is assumed to be the tonic.
Steps:

Use in System: Tonic is used to convert absolute pitch to

relative pitch in cents, making features tonic-invariant,
which is critical for raga modeling.

13
CHAPTER 3. METHODOLOGY

1.1.2 Mel-Spectrogram
Librosa Function: librosa.feature.melspectrogram(y=y,
sr=sr, n_mels=N)
Description: A perceptually scaled spectrogram emphasizing
human auditory resolution.

Use in the system: Captures the timbral texture of the

instrument and the overall spectral envelope, aiding in
distinguishing between raga types and instrument
articulations.

1.1.4 MFCC + Delta-MFCC

Librosa Function: librosa.feature.mfcc(),
librosa.feature.delta() Description: Compact
representation of the spectrum using
cepstral coefficients. Delta captures the derivative (velocity).
Equation:

Use: Encodes instrument characteristics and articulation

over time — important for distinguishing how a raga is
rendered on different instruments.
14
CHAPTER 3. METHODOLOGY

15
CHAPTER 3. METHODOLOGY

1.1.5 Chroma Features

Librosa Function: librosa.feature.chroma_stft(y, sr)
Description: Maps energy to 12 pitch classes (C, C#, D, ...,
B), regardless of octave.
Equation:

Where Pn is the set of frequencies corresponding to pitch class

n.
Use in the system: Represents scale structure
(Arohana/Avarohana) and note emphasis, crucial for raga
differentiation.

1.1.6 Tonnetz (Tonal Centroids)

Librosa Function: librosa.feature.tonnetz(y=y, sr=sr)
Description: Captures harmonic relations such as
perfect fifths, minor thirds, and major thirds.
Use in the system: Helps in identifying harmonic
characteristics of ragas, especially when played on
harmonic-rich instruments.

3.3.7 Pitch Histogram

Librosa Function: Custom histogram on pitch contour (in
cents)
Description: Represents distribution of note occurrences
relative to tonic.
Use in the system: Detects swara usage patterns.
Supports Vadi–Samvadi detection.

16
CHAPTER 3. METHODOLOGY

3.3.8 Rhythm Profile (Tempogram)

Librosa Function:
 librosa.onset.onset_strength(y, sr)
 librosa.feature.tempogram(onset_envelope)

Use in the system: Models Taal structure and rhythm,

which differentiate ragas sharing melodic patterns.

3.3.9 Vadi–Samvadi Estimation

Librosa Function: None (custom logic from pitch histogram)
Logic:
 Vadi and Samvadi are inferred as the two most
prominent bins in the pitch histogram.
 Sorted using np.argsort(histogram)[-2:].
Use: Encodes central emotional and structural swaras of
the raga.

3.3.10 Tonic (Hz)

Used for pitch normalization.
Extracted with pitch histogram from piptrack().

17
CHAPTER 3. METHODOLOGY

3.4 Feature Normalization and

Resampling
To ensure uniformity across input samples, features are
resampled and normalized.

1.1.7 Interpolation of Pitch Contour

Method: Linear interpolation using scipy.interpolate.interp1d
 Each pitch contour is interpolated to fixed length
(e.g., 216)
regardless of actual duration.
 This allows the model to learn on a fixed input size.

1.1.8 Normalization Techniques

Min-Max Normalization (e.g., for pitch histogram,
rhythm profile):

Z-Score Normalization (e.g., for MFCC, mel-spectrogram):

Purpose:
 Prevents features with large values (e.g., energy-
based) from dominating others.
 Accelerates training convergence in neural networks.

18
CHAPTER 3. METHODOLOGY

1.1.9 Padding and Truncation

 Mel-spectrograms and MFCCs are truncated or zero-
padded to a fixed number of frames.
 Makes input dimensions consistent across all samples.

1.1.10 Tensor Conversion

All processed features are converted to torch.Tensor
objects with standardized dtype=torch.float32,
making them ready for training.

19
CHAPTER 3. METHODOLOGY

1.2 Multi - Branch Deep Learning

Architecture
The proposed model is a multi-branch deep neural
network, where each branch is dedicated to learning from
a specific group of audio features. This architectural choice
allows the system to extract rich, domain- specific
representations from various aspects of the audio—such
as timbre, pitch, melody, harmony, and rhythm—before
merging them for final classification.

1.2.1 Architectural Rationale

Indian classical ragas are complex musical entities defined
not only by scale patterns, but also by specific melodic
contours (pakad), rhythm cycles (taal), and note emphasis
(vadi–samvadi). No single feature captures this entirely.
Hence, the model processes:
• Time-frequency features with Convolutional Neural
Networks (CNN)
• Sequential melodic features with Recurrent Neural
Networks (GRU)
• Statistical and structural features with Multi-Layer
Perceptrons (MLP)
These feature embeddings are fused into a final feature
vector, which is passed through fully connected layers to
predict the raga.

20
CHAPTER 3. METHODOLOGY

Model Type
Feature Type Feature Name Used Why?
Time- Mel- Multilayer Learns spatial
Frequency Spectrogram CNN (time-frequency)
Map patterns
Time-Series MFCC (+Delta) CNN + GRU Captures temporal
Sequence evolution of
timbre
Time-Series Pitch Contour Multilayer Captures
Sequence (Cents) GRU sequential melodic
flow
Vector Pitch MLP (Fully Encodes note
(Statistical) Histogram Connected) usage distribution
Vector Chroma, MLP Encodes harmonic
(Statistical) Tonnetz structure
Vector Rhythm MLP Represents beat
(Statistical) (Tempogram) and rhythmic
cycle
Vector Vadi–Samvadi MLP Categorical and
(Categorical) + Tonic scalar musical
metadata

21
CHAPTER 3. METHODOLOGY

3.5.2 Branch 1: CNN on Mel-Spectrogram

What it does:
This branch analyzes the mel-spectrogram, which is a heatmap-
like image that shows how frequencies evolve over time. This helps
the model learn instrument timbre, texture, and raga-specific
spectral patterns.
Purpose:
Captures local spectral patterns and instrument-specific
timbre from 2D time–frequency representations.
Input:
A matrix of shape mel_bins × time_frames (like a 128×216
image)
Mel-spectrogram matrix X ∈ R M × T where:
• M : number of mel bins (e.g., 128)
• T : time frames

Layer-wise Operations:
(1) 2D Convolution Layer
Applies a set of filters (small windows) that slide over the mel-
spectrogram and extract patterns.
Equation:
( 1)
hij =σ (∑ W ⋅ X +b )

Where:
• X = input mel-spectrogram

• W = learned filters

• b = bias

• σ = activation (like ReLU)

22
CHAPTER 3. METHODOLOGY

Applies a set of filters W ∈ R k m ×k t

over the mel-spectrogram:

(∑ ∑ )
k m −1 kt −1
( 1)
h =σ
ij W mt ⋅ X i +m , j+t + b
m=0 t =0

Where:
• σ : activation function (ReLU)
• b : bias

(2) Batch Normalization + ReLU

• Normalizes output for better training

• ReLU adds non-linearity

(3) Max Pooling

• Downsamples the output by picking the maximum in small
regions:
( 2) (1 )
hij =max hi+m , j +t
region

( 2) ( 1)
hij = max hi +m , j+t
( m ,t ) ∈ pool

Reduces dimensionality, retains important features.

(4) Dropout (Regularization)

Randomly zeroes out neurons during training to prevent overfitting.

(5) Flatten → Fully Connected Layer

Flattens and projects to a latent vector z mel ∈ Rd
• Flattens the image and transforms it into a 1D vector (feature
embedding).
Output:
• A vector z mel that encodes spectral features of the input audio.

23
CHAPTER 3. METHODOLOGY

3.5.3 Branch 2: CNN + GRU on MFCC

What it does:
Extracts both short-term patterns (with CNN) and long-term
sequential dependencies (with GRU) from MFCCs and their
derivatives.
Purpose:
Encodes timbre, temporal envelope, and sequential variations
using both CNN and RNN.
Input:
• MFCC matrix of shape num_features × time_frames (e.g., 26 ×
216)
MFCC matrix with delta: X ∈ R F ×T
Layer-wise Operations:
(1) 1D Convolution over time axis
• Detects local variations in features across time.
Equation:
( 1)
ht =σ ( ∑ W f ⋅ X f , t +b )

(∑ )
k−1
( 1)
h =σ
t W f ⋅ X f ,t + b
f =0

24
CHAPTER 3. METHODOLOGY
(2) GRU- Gated Recurrent Unit Layer (RNN)
GRU is a type of RNN that processes sequences and remembers
important information.
Models sequential dependency across time:
GRU Update Equations:
• Update gate: how much to keep from the past
z t =σ ( W z x t + U z h t−1 )

• Reset gate: how much to forget

r t =σ ( W r xt +U r ht −1 )

• Candidate state:
~
h t=tanh ( W h x t +U h ( r t ⊙ ht−1 ) )

• Final state:
~
ht =( 1−z t ) ⊙ht −1 + z t ⊙ ht

Where:
• x t : input at time t
• ht : hidden state
• z t : update gate
• r t : reset gate

(3) Final Hidden State → Dense Layer

The final output vector z mfcc represents the timbral-temporal
encoding
Output embedding vector z mfcc ∈ R d

25
CHAPTER 3. METHODOLOGY

3.5.4 Branch 3: GRU on Pitch Contour

What it does:
Processes the pitch contour (melody line) over time to learn the
shape and grammar of the raga.
Purpose:
Captures the melodic progression of the raga (Arohana–
Avarohana), swara transitions, and ornamentations.
Input:
• 1D vector of pitch in cents (normalized with respect to tonic)
Pitch contour (in cents), a 1D time-series vector: x=[ x 1 , x 2 ,... , x T ]
Layer-wise Operations:
(1) GRU Layer
• Learns the sequence of pitch movements (Arohana–Avarohana,
Pakad)
Same GRU update rules as above. Output sequence [ h1 , h2 ,... , hT ]
(2) Last Hidden State or Average Pooling of hidden states:
• Takes either the last GRU output or average of all hidden
states:
T
1
z pitch = ∑h
T t =1 t

26
CHAPTER 3. METHODOLOGY

3.5.5 Branch 4: MLP on Statistical Features

What it does:
Processes non-sequential features such as:
• Pitch histogram (swara distribution)
• Chroma (note classes)
• Tonnetz (harmonic distance)
• Rhythm profile (taal)
• Vadi–Samvadi
• Tonic (Hz)
Purpose:
Models swara importance, rhythm profile, and other high-level
statistics.
Input:
• A 1D feature vector combining all of the above
Concatenated feature vector:
x=[ pitch_histogram , chroma , tonnetz , rhythm , vadi–samvadi , tonic ]

Layers:
(1) Dense Layer:
( 1)
h =σ ( W 1 x +b 1)

(2) Dropout + ReLU

(3) Dense Layer:
z mlp=σ ( W 2 h( 1)+ b2 )

Output:
• A learned vector that summarizes the structural musical
features.

27
CHAPTER 3. METHODOLOGY

3.5.6 Feature Fusion Layer

What it does:
Concatenates the outputs from all branches:
z fused= [ z mel , z mfcc , z pitch , z mlp ]

This combined vector has comprehensive knowledge of the

spectral, melodic, temporal, and statistical structure of the
raga.

3.5.7 Fully Connected Layers (Classifier)

(1) Dense Layer:
h=σ ( W fc ⋅ z fused +b fc )

(2) Dropout
(3) Output Layer:
^y =softmax ( W out ⋅ h+ bout )

Where:
• ^y is the predicted probability for each raga class

3.5.8 Loss Function (Categorical Cross-Entropy)

What it does:
Measures how far the predicted class is from the true raga label.
Equation:
C
LCE =−∑ y i ⋅ log ( ^y i )
i=1

Where:
• C : Number of raga classes
• y i: Ground truth (one-hot)
• ^y i: Predicted probability
28
CHAPTER 3. METHODOLOGY

Summary
Branch Learns from… Learns what…
CNN on Mel Spectrogram Instrument sound, texture
CNN + GRU on Cepstral + sequential Timbre + temporal change
MFCC info
GRU on Pitch Pitch contour Raga melody pattern, pakad
MLP on Stats Global features Swara use, harmony, rhythm,
tonic
Fusion → All of the above Predicts the most likely raga
Classifier

Why This Architecture Works for Raga Detection

Importance for
Component What It Learns Raga
CNN on Mel Instrumental timbre & Timbre-rich clues
spectral patterns
CNN+GRU on Envelope changes and Artist/instrument
MFCC sequential tone shifts styles
GRU on Pitch Melodic grammar & phrase Pakad & Arohana
curves
MLP on Stats Taal, tonic, swara emphasis Global structure

Advantages of This Architecture

Feature Benefit
Multi-Branch Design Specializes for different types of input
representations
CNNs for Extract localized spectral features robust
Spectrograms to noise
GRUs for Melody Captures raga flow and phrase transitions
(Pakad)
MLP for Statistics Simple and efficient modeling of flat
distributions
Tonic-Normalized Handles pitch shifts across artists and
Modeling instruments
29
CHAPTER 3. METHODOLOGY

30
CHAPTER 3. METHODOLOGY

• R ESPONSE GENERATION

The system generates a response based on the analyzed query and provides
the answer in text format. Optionally, the response can also be delivered in
spoken form using text-to-speech technology.

3.2 Deleting Current Chat Session and Clearing

Cache
Our system includes features for deleting the current chat session and clearing the cache.
These functionalities allow users to manage their interactions and ensure a fresh start
when- ever needed.

Functional overview
• D ELETING C URRENT CHAT S ESSION

Users can delete their current chat session through the user interface. This re-
moves the ongoing conversation, allowing users to start anew without
previous context.

Figure 3.21: Deleting current chat session and clearing cache

• C LEARING CACHE

Users have the option to clear the system’s cache. This removes all
temporary data, ensuring that the system starts fresh with no residual
information from past interactions.

31
Conclusion
4
The results of this QA system using the NLP project were very
promising and well produced as we intended. This approach involves a
large number of questions processing simultaneously and a large amount of
data given as input to the question and answering system. It involves Natural
language processing that is performed based on linguistic analysis. The
question is given in Natural language through the input device where the
system treats it as a query and re- turns the best suitable statement or
sentence as the Output.
The Question and answering system produces only the requested
information instead of searching the whole database or whole
documents as the search engine does, which is generally a time-consuming
process. In our daily life, information is increasing rapidly so extracting
even the required piece of information requires many resources.

32
Bibliography

[1] A. ALLAM AND M. HAGGAG, The question answering systems: A survey,

International Journal of Research and Reviews in Information Sciences, 2 (2012),
pp. 211–221.

[2] A. Q. JIANG, A. S ABLAYROLLES , A. M ENSCH , C. BAMFORD, D. S. CHAPLOT, D. DE LAS

CASAS, F. BRESSAND, G. L ENGYEL , G. L AMPLE , L. SAULNIER, L. R. LAVAUD, M.-A.

LACHAUX, P. STOCK, T. L. SCAO, T. LAVRIL, T. WANG, T. LACROIX, AND W. E. SAYED,
Mistral 7b, 2023.

[3] H. LIU, C. LI, Y. LI, AND Y. J. LEE, Improved baselines with visual instruction
tuning, 2024.

[4] H. LIU, C. LI, Q. WU, AND Y. J. LEE, Visual instruction tuning, 2023.

[5] B. OJOKOH AND E. ADEBISI, A review of question answering systems, Journal of Web En-
gineering, 17 (2018), pp. 717–758.

[6] A. RADFORD, J. W. KIM, T. XU, G. B ROCKMAN , C. MCLEAVEY, AND I. S UTSKEVER , Robust

speech recognition via large-scale weak supervision, 2022.

[7] A. RADFORD, J. WU, R. CHILD, D. LUAN, D. A MODEI , AND I. S UTSKEVER , Language

models are unsupervised multitask learners, (2019).

[8] H. TOUVRON, L. MARTIN, K. STONE, P. ALBERT, A. A LMAHAIRI , Y. B ABAEI , N. BASH-

LYKOV, S. BATRA, P. BHARGAVA, S. B HOSALE , D. B IKEL , L. B LECHER , C. C. FER-
RER, M. CHEN, G. CUCURULL, D. ESIOBU, J. FERNANDES, J. FU, W. FU, B. FULLER,
C. GAO, V. GOSWAMI, N. GOYAL, A. H ARTSHORN , S. H OSSEINI , R. HOU, H. INAN,
M. KARDAS, V. KERKEZ, M. KHABSA, I. KLOUMANN, A. KORENEV, P. S. KOURA, M.-
A. LACHAUX, T. LAVRIL, J. LEE, D. LISKOVICH, Y. LU, Y. MAO, X. MARTINET, T. MI-
HAYLOV, P. MISHRA, I. MOLYBOG, Y. NIE, A. POULTON, J. REIZENSTEIN, R. RUNGTA,
K. S ALADI , A. SCHELTEN, R. SILVA, E. M. S MITH , R. S UBRAMANIAN , X. E. TAN,
B. TANG, R. TAYLOR, A. W ILLIAMS , J. X. KUAN, P. XU, Z. YAN, I. ZAROV, Y. ZHANG,
A. FAN, M. K AMBADUR , S. N ARANG , A. R ODRIGUEZ , R. S TOJNIC , S. EDUNOV, AND
T. SCIALOM, Llama 2: Open foundation and fine-tuned chat models, 2023.

Algebraic Geometry - A First Course - Joe Harris - Harvard University
86% (7)
Algebraic Geometry - A First Course - Joe Harris - Harvard University
337 pages
NLP in Education QA Systems
No ratings yet
NLP in Education QA Systems
6 pages
Scientific Notation Unit Test
100% (1)
Scientific Notation Unit Test
3 pages
NLP Assignment
No ratings yet
NLP Assignment
3 pages
Aman CV
No ratings yet
Aman CV
2 pages
Back Propagation Phases: 1. Forward Pass
No ratings yet
Back Propagation Phases: 1. Forward Pass
2 pages
App
No ratings yet
App
2 pages
The Question Answering Systems A Survey
No ratings yet
The Question Answering Systems A Survey
13 pages
Final Presentation
No ratings yet
Final Presentation
13 pages
The Yellow World How Fighting For My Life Taught Me How To Live Espinosa Albert Download
No ratings yet
The Yellow World How Fighting For My Life Taught Me How To Live Espinosa Albert Download
35 pages
Question Answering Systems - PPTX 20250902 070209 0000
No ratings yet
Question Answering Systems - PPTX 20250902 070209 0000
12 pages
Screening and Assessment LD
No ratings yet
Screening and Assessment LD
63 pages
Differentiate Between
No ratings yet
Differentiate Between
9 pages
Software Engineering Lab-1
No ratings yet
Software Engineering Lab-1
18 pages
Whirlpool Schema
No ratings yet
Whirlpool Schema
11 pages
Presentation 2 K
No ratings yet
Presentation 2 K
12 pages
A Question Answering System Application Integrated
No ratings yet
A Question Answering System Application Integrated
10 pages
23mca1047
No ratings yet
23mca1047
57 pages
Factoid Question Answering
No ratings yet
Factoid Question Answering
2 pages
Economics of Oil Prices 2
No ratings yet
Economics of Oil Prices 2
8 pages
Group 17 - Research Proposal-1
No ratings yet
Group 17 - Research Proposal-1
36 pages
Project Report 8th Sem 2 Final Edit
No ratings yet
Project Report 8th Sem 2 Final Edit
29 pages
Calculus and Its Applications 11th Edition Bittinger Solutions Manualpdf Download
100% (13)
Calculus and Its Applications 11th Edition Bittinger Solutions Manualpdf Download
42 pages
IA1 Case Study Material-AI and ML
No ratings yet
IA1 Case Study Material-AI and ML
3 pages
Chapter 6 NLP
No ratings yet
Chapter 6 NLP
16 pages
IRJET Automated MCQ Generator Using Natu
No ratings yet
IRJET Automated MCQ Generator Using Natu
6 pages
EWD Camry 2006
No ratings yet
EWD Camry 2006
400 pages
Sravanthi Thesis
No ratings yet
Sravanthi Thesis
76 pages
BestSub Heat Press Catalog 2024
No ratings yet
BestSub Heat Press Catalog 2024
37 pages
Question-And-Answer System Using Natural Language Processing
No ratings yet
Question-And-Answer System Using Natural Language Processing
19 pages
Phase One Team5
No ratings yet
Phase One Team5
9 pages
NLP - PPT - CH 9
No ratings yet
NLP - PPT - CH 9
36 pages
Seeeeeee This 115021
No ratings yet
Seeeeeee This 115021
13 pages
Title List
No ratings yet
Title List
2 pages
Term Paper
No ratings yet
Term Paper
7 pages
Others Indigo Case Study
No ratings yet
Others Indigo Case Study
9 pages
First Page 1 - Removed
No ratings yet
First Page 1 - Removed
19 pages
Mini-Vert Brochure
No ratings yet
Mini-Vert Brochure
4 pages
Anatomy of Long-Form Content To KBQA System and QA Generator
No ratings yet
Anatomy of Long-Form Content To KBQA System and QA Generator
8 pages
Very Good For Transformer
No ratings yet
Very Good For Transformer
34 pages
EMCP4.1 4.2 M05 CANExtMods EN INS
No ratings yet
EMCP4.1 4.2 M05 CANExtMods EN INS
14 pages
SAMINA MSC CSE 1014140010
No ratings yet
SAMINA MSC CSE 1014140010
76 pages
JPNR S10 3301
No ratings yet
JPNR S10 3301
7 pages
Ep 20 Units
No ratings yet
Ep 20 Units
142 pages
JPNR S10 3301
No ratings yet
JPNR S10 3301
7 pages
1604-Article Text-2993-1-10-20210407
No ratings yet
1604-Article Text-2993-1-10-20210407
7 pages
QASs Presentation
No ratings yet
QASs Presentation
20 pages
Rethinking Search: Making Domain Experts Out of Dilettantes
No ratings yet
Rethinking Search: Making Domain Experts Out of Dilettantes
27 pages
Term Paper by Hana
No ratings yet
Term Paper by Hana
21 pages
NLP Q&A Systems Overview
No ratings yet
NLP Q&A Systems Overview
21 pages
LTC Springer Preprint
No ratings yet
LTC Springer Preprint
11 pages
Tsa Ut III Tsa Notes
No ratings yet
Tsa Ut III Tsa Notes
30 pages
Relevant Result Generation by Harvesting Web Information: Umakant Bhate Lalit Waghulkar Darshan Yeola
No ratings yet
Relevant Result Generation by Harvesting Web Information: Umakant Bhate Lalit Waghulkar Darshan Yeola
2 pages
The Idea by Woobensky
No ratings yet
The Idea by Woobensky
4 pages
Conversational Question Answering: A Survey: Noname Manuscript No
No ratings yet
Conversational Question Answering: A Survey: Noname Manuscript No
46 pages
2017-Jayalakshmi S. Et Al-Automated Question Answering System Using Ontology and Semantic Role
No ratings yet
2017-Jayalakshmi S. Et Al-Automated Question Answering System Using Ontology and Semantic Role
5 pages
Zhou 2019 IOP Conf. Ser. Mater. Sci. Eng. 612 032180
No ratings yet
Zhou 2019 IOP Conf. Ser. Mater. Sci. Eng. 612 032180
9 pages
Development of Agriculture Chatbot Using Machine Learning Techniques
No ratings yet
Development of Agriculture Chatbot Using Machine Learning Techniques
5 pages
Automated Question Generator System Using NLP Libraries
No ratings yet
Automated Question Generator System Using NLP Libraries
5 pages
Blackbook Format
No ratings yet
Blackbook Format
70 pages
Falke Talk - The Falke 80 - 90 Serial No Database - 03
No ratings yet
Falke Talk - The Falke 80 - 90 Serial No Database - 03
5 pages
Automated Chatbot Implemented Using Natural Language Processing PDF
No ratings yet
Automated Chatbot Implemented Using Natural Language Processing PDF
5 pages
Voice Based System Assistant Using NLP and Deep Learning-1
No ratings yet
Voice Based System Assistant Using NLP and Deep Learning-1
82 pages
Tsa Ut Iii
No ratings yet
Tsa Ut Iii
28 pages
Exploring The State of The Art in Legal QA Systems s40537-023-00802-8
No ratings yet
Exploring The State of The Art in Legal QA Systems s40537-023-00802-8
33 pages
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
No ratings yet
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
26 pages
Complete Black Book
57% (7)
Complete Black Book
59 pages
Tsa Ut III Tsa Notes
No ratings yet
Tsa Ut III Tsa Notes
30 pages
English 5 Co Combined
100% (2)
English 5 Co Combined
85 pages
AI Interviewer Project Report
No ratings yet
AI Interviewer Project Report
45 pages
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
No ratings yet
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
13 pages
A Review On Question Generation System From Punjabi Text
No ratings yet
A Review On Question Generation System From Punjabi Text
3 pages
NLP PBL
No ratings yet
NLP PBL
21 pages
Intelligent Question Answering System
No ratings yet
Intelligent Question Answering System
50 pages
Automated Quiz Generation Tool
No ratings yet
Automated Quiz Generation Tool
11 pages
Sample ICT Action Plan
100% (2)
Sample ICT Action Plan
2 pages
Question Answering System For Election Database in Telugu Language
No ratings yet
Question Answering System For Election Database in Telugu Language
5 pages
Existentialist Feminism and Simone de Beauvoir PDF
No ratings yet
Existentialist Feminism and Simone de Beauvoir PDF
2 pages
Chapter 1 SAD
No ratings yet
Chapter 1 SAD
8 pages
Intelligent Q &A PDF
No ratings yet
Intelligent Q &A PDF
5 pages
Question Answering System: 296: Natural Language Processing
No ratings yet
Question Answering System: 296: Natural Language Processing
30 pages
Value Added Products From PFAD PDF
No ratings yet
Value Added Products From PFAD PDF
60 pages
Disorders of The Thyroid Gand
No ratings yet
Disorders of The Thyroid Gand
167 pages
Images Line Drawings and Backplanes
No ratings yet
Images Line Drawings and Backplanes
27 pages
Physics1 PDF
No ratings yet
Physics1 PDF
7 pages
Closed Domain Keyword Based Question Answering System For Legal Documents of IPC Sections & Indian Laws
No ratings yet
Closed Domain Keyword Based Question Answering System For Legal Documents of IPC Sections & Indian Laws
13 pages
The Book of The Dun Cow by Walter Wangerin - Teacher Study Guide
No ratings yet
The Book of The Dun Cow by Walter Wangerin - Teacher Study Guide
33 pages
Share 'Ch05
100% (1)
Share 'Ch05
81 pages
Applying Deep Learning To Answer Selection - A Study and An Open Task
No ratings yet
Applying Deep Learning To Answer Selection - A Study and An Open Task
8 pages
Advancements in Question Answering Systems Towards Indic Languages
No ratings yet
Advancements in Question Answering Systems Towards Indic Languages
12 pages
Sodium Chloride Nacl Data Sheet
No ratings yet
Sodium Chloride Nacl Data Sheet
1 page
The Question Answering System Using NLP and AI
No ratings yet
The Question Answering System Using NLP and AI
6 pages
AES DRRM Memo PASS
No ratings yet
AES DRRM Memo PASS
2 pages
Bacterii
No ratings yet
Bacterii
11 pages