QAS Final Report-2
QAS Final Report-2
Answering System
A DISSERTATION
Master of Science
in
Computer Science
by
Chaiti Paul
Student roll:
22332305 Deep
Das
Student roll: 22332306
W
e,Chaiti Paul and textitDeep Das declare that this that the
submitted for a degree elsewhere, in part or in full. Further, due credit has
SIGNATURE:
NO: 22342406
SIGNATURE:
i
C ERTIFICATE
T
his is to certify that the work entitled Multisense GPT, a
and Deep Das, has been carried out under my supervision for
en- dorse any of the statements made or opinions expressed therein but
ii
A CKNOWLEDGMENTS
F
irst, I express my gratitude to the Almighty, who blessed me with
Bar- rackpore, Kolkata for their motivation and tireless efforts to help me to
throughout the life cy- cle of my M.SC. dissertation work. Especially, the
I also thank Prof. Kunal Das, Head of the Computer Science Department,
for her fruitful guidance through the early years of chaos and confusion. I
wish to thank the faculty members and supporting staff of the Computer
Science De- partment for their full support and heartiest cooperation.
This thesis would not have been possible without the hearty support of
affection, and continuous support. Last but not least, I thank GOD, the
almighty for giving me the inner willingness, strength, and wisdom to carry
iii
A BSTRACT
Q
uestion Answering system(QAS) is an information retrieval
Extraction
iv
Table of Contents
1 Introduction 1
2 Literature Review 3
2.1 LLaVA...................................................................................................................................3
2.2 ggml-model-q5-k..............................................................................................................3
2.3 mmproj-model-f16...........................................................................................................3
2.4 Mistral-7b-instruct-v0.1.Q5-K-M.......................................................................................3
2.5 Whisper AI...........................................................................................................................4
2.6 GPT-2....................................................................................................................................4
2.7 Llama-2-13B-chat-GGML...................................................................................................4
3 Methodology 5
Working model 5
Features 7
3.1 Knowledge Base-Driven Responses Without Input.....................................................7
3.2 PDF Analysis and Question Answering...........................................................................8
3.3 Photo Analysis and Question Answering.....................................................................10
3.4 Video Analysis and Question Answering.....................................................................13
3.5 Session Management....................................................................................................18
3.6 Voice Commands...........................................................................................................19
3.7 Deleting Current Chat Session and Clearing Cache.................................................20
4 Conclusion 21
Bibliography 22
v
Introduction
1
The proliferation of internet usage in conjunction with the remarkable ex-
pansion of data storage capacity has granted us the capability to both
securely archive and effectively disseminate data to the public. However,
sifting through this vast amount of data has made finding information
time-consuming and complex. As a result, there has been a push to develop
new research tools, such as Question Answering Systems, to address this
challenge. Question answer- ing systems show a noteworthy
advancement in information retrieval tech- nologies, especially in their
ability to access knowledge resources naturally by querying and retrieving
the right replies in succinct words.
Since the 1960s, numerous QA Systems (QAS) have been developed to
re- spond to user inquiries through various approaches. These systems
have ad- dressed a wide range of domains, databases, question types, and
answer struc- tures. Modern methods involve retrieving and analyzing
data from multiple sources to effectively respond to questions presented in
natural language. Ques- tion Answering (QA)[1] system is a multidisciplinary
research area that encom- passes Information Retrieval (IR), Information
Extraction (IE), and Natural Lan- guage Processing (NLP). The objective of
a question-answering system is to provide answers to queries rather than
overwhelming users with full documents or the most pertinent passages,
which is the typical functionality of most infor- mation retrieval systems. E.g.
“Who is the first prime minister of India?” the ex-
act answer expected by the user for this question is (Pandit Jawaharlal
Nehru), but not intend to read through the passages or documents that
match with words like first, prime minister, India, etc.
The question-answering system[5] is continually advancing and
improv- ing to meet the challenges and opportunities in this field through
the integra- tion of new trends and innovations. From the initial text-based
programs to the current AI-powered question-answering systems, virtual
assistants, and chat- bots have emerged as sophisticated communication
tools, offering automation and customer support capabilities. Numerous
1
businesses utilize these systems on their websites and social media accounts
to offer round-the-clock customer
2
CHAPTER 1. INTRODUCTION
support without the need for a large team of human agents. Advanced
chatbots are also employed in the healthcare, finance, and education
sectors to deliver personalized assistance and support. Additionally, some
chatbots can collect user data, which can be used to improve marketing
campaigns and person- alize the customer experience. One of the
renowned chatbots in the current market - ChatGPT is a sophisticated
language model designed to enhance the engagement and informativeness
of chatbots. Trained on an extensive dataset comprising text and code,
ChatGPT possesses the capacity to comprehend and produce human-like
text. However, despite their advanced capabilities, these systems are not
without limitations. When used as a QAS without any text for answer
extraction, systems like ChatGPT must rely on its knowledge to generate
answers, which can be strongly influenced by training bias. Also, these
systems cannot interact with audio/video data or PDFs.
We attempted to integrate these types of data into our system. Our
sys- tem can process audio, PDFs, and images to provide human-like
responses to user queries. Without the context to generate responses, it uses
its knowledge base to provide an answer. We have also incorporated a
voice feature into our system to offer a dependable and user-friendly
environment. Furthermore, we can store users’ previous conversation
sessions along with timestamps in the system database, enabling users
to access and review them anytime.
3
Literature Review
2
We selected the models based on their state-of-the-art performance in various NLP
tasks and their availability through trusted providers. The models are chosen to meet our
perfor- mance requirements, handle our data efficiently, and to seamlessly integrate well
into our system architecture. The models that we have utilized to implement our system
are LLaVA, ggml-model-q5-k, mmproj-model-f16, Mistral-7b-instruct-v0.1.Q5-K-M,
Whisper AI, GPT- 2, Llama-2-13B-chat-GGML. few studies on these models are the
following:
2.1 LLaVA
LLaVA(Large Language and Vision Assistant)[4] is an open-source chatbot, an end-
to-end trained large multimodal model that connects a vision encoder and LLM for
general-purpose visual and language understanding, achieving impressive chat
capabilities. It is designed to understand and generate content based on visual
inputs (images) and textual instructions.
2.2 ggml-model-q5-k
The ggml-model-q5-k is utilized to produce responses for queries related to PDFs. It is a file
structure in llava-v1.5-13b[3].
2.3 mmproj-model-f16
The mmproj-model-f16 is utilized to produce responses for queries related to images. It
is a file structure in llava-v1.5-7b.
2.4 Mistral-7b-instruct-v0.1.Q5-K-M
The Mistral-7B-Instruct-v0.1[2] Large Language Model (LLM) is an instruct fine-tuned ver-
sion of the Mistral-7B-v0.1 generative text model using a variety of publicly available
4
con- versation datasets.
5
CHAPTER 2. LITERATURE REVIEW
2.5 Whisper AI
Whisper[6] is a general-purpose speech recognition model. It is trained on a large
dataset of diverse audio and is also a multitasking model that can perform multilingual
speech recog- nition, speech translation, and language identification. Whisper is a
transformer-based encoder-decoder model. It was trained on 1 million hours of
weakly labeled audio and 4 million hours of pseudolabeled audio collected. Whisper
performs the task of speech tran- scription, where the source audio language is the
same as the target text language.
2.6 GPT-2
The GPT-2[7] model was introduced in the paper "Language Models are Unsupervised
Mul- titask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Amodei, and Ilya Sutskever from OpenAI. It is a causal (unidirectional) transformer that
was pretrained using language modeling on a large dataset of approximately 40 GB of
text. GPT-2 is a large transformer-based language model with 1.5 billion
parameters, and it was trained on a dataset consisting of 8 million web pages.
2.7 Llama-2-13B-chat-GGML
Llama 2[8] is a collection of pre-trained and fine-tuned generative text models
ranging in scale from 7 billion to 70 billion parameters. Llama 2 comes in a range of
parameter sizes
— 7B, 13B, and 70B — as well as pre-trained and fine-tuned variations.
Llama 2-Chat is a family of fine-tuned Llama-2 models that are optimized for dialogue
use cases. These models are specifically designed to generate human-like responses to
nat- ural language input, making them suitable for chatbot and conversational AI
applications.
6
Methodology
3
3.1 System Overview
The proposed system for Indian Classical Instrumental
Raga Detection is a comprehensive, data-driven
architecture that combines advanced audio signal
processing and deep learning techniques. Its design is
inspired by the rich, multidimensional nature of Indian
classical music, which incorporates melody, rhythm, pitch
structures, and modal scales. The system aims to identify
the raga of a given instrumental audio clip with high
accuracy by analyzing both low-level and high-level
musical features.
7
CHAPTER 3. METHODOLOGY
System Strengths
Tonic-invariant: Uses tonic-normalized pitch features.
Multi-dimensional: Considersmelody,
rhythm, and harmony simultaneously.
Modular: Easily extendable for new features or musical
styles.
Instrument-agnostic: Designed specifically for
instrumental music, which lacks syllabic markers present in
vocals.
9
CHAPTER 3. METHODOLOGY
Data Organization
The audio files were organized in a folder hierarchy such as:
Data/-
/-raga bhairav/ .wav files
/-raga bhoop/ .wav files
/-raga madhuvanti/ .wav files
/-raga shudh sarang/ .wav files
/-raga yaman/ .wav files
10
CHAPTER 3. METHODOLOGY
1. Resampling
All audio signals were resampled to a standard sampling
rate of 22,050 Hz to match the input requirements of
librosa and reduce computational overhead. This rate is
sufficient for capturing essential harmonic and melodic
content up to 11 kHz, which includes all necessary frequency
information for Indian classical instruments.
2. Mono Conversion
Recordings with stereo channels were converted to mono by
averaging the two channels. This simplifies analysis and
removes any spatial or panning information which is
irrelevant for raga recognition.
3. Duration Normalization
To maintain uniform input dimensions:
Audio clips were trimmed to a maximum duration of X
seconds
Shorter clips were zero-padded to match the target
duration
This step ensures that extracted features (e.g., mel-spectrograms,
pitch contours, etc) have consistent temporal dimensions across all
samples.
11
CHAPTER 3. METHODOLOGY
Train–Validation–Test Split:
Use multiple splits for better distribution and accuracy:
Example1:
o Training set: 70%
o Validation set: 15%
o Test set: 15%
Example2:
o Training set: 80%
o Validation set: 10%
o Test set: 10%
Note :
The preprocessing and data preparation phase ensured that the
raw audio recordings were converted into a clean, consistent,
and augmented format suitable for deep learning. Careful
attention was given to maintain musical integrity, especially in
preserving the tonic reference and melodic content that are
crucial for successful raga detection.
12
CHAPTER 3. METHODOLOGY
13
CHAPTER 3. METHODOLOGY
1.1.2 Mel-Spectrogram
Librosa Function: librosa.feature.melspectrogram(y=y,
sr=sr, n_mels=N)
Description: A perceptually scaled spectrogram emphasizing
human auditory resolution.
15
CHAPTER 3. METHODOLOGY
16
CHAPTER 3. METHODOLOGY
17
CHAPTER 3. METHODOLOGY
Purpose:
Prevents features with large values (e.g., energy-
based) from dominating others.
Accelerates training convergence in neural networks.
18
CHAPTER 3. METHODOLOGY
19
CHAPTER 3. METHODOLOGY
20
CHAPTER 3. METHODOLOGY
Model Type
Feature Type Feature Name Used Why?
Time- Mel- Multilayer Learns spatial
Frequency Spectrogram CNN (time-frequency)
Map patterns
Time-Series MFCC (+Delta) CNN + GRU Captures temporal
Sequence evolution of
timbre
Time-Series Pitch Contour Multilayer Captures
Sequence (Cents) GRU sequential melodic
flow
Vector Pitch MLP (Fully Encodes note
(Statistical) Histogram Connected) usage distribution
Vector Chroma, MLP Encodes harmonic
(Statistical) Tonnetz structure
Vector Rhythm MLP Represents beat
(Statistical) (Tempogram) and rhythmic
cycle
Vector Vadi–Samvadi MLP Categorical and
(Categorical) + Tonic scalar musical
metadata
21
CHAPTER 3. METHODOLOGY
Layer-wise Operations:
(1) 2D Convolution Layer
Applies a set of filters (small windows) that slide over the mel-
spectrogram and extract patterns.
Equation:
( 1)
hij =σ (∑ W ⋅ X +b )
Where:
• X = input mel-spectrogram
• W = learned filters
• b = bias
22
CHAPTER 3. METHODOLOGY
(∑ ∑ )
k m −1 kt −1
( 1)
h =σ
ij W mt ⋅ X i +m , j+t + b
m=0 t =0
Where:
• σ : activation function (ReLU)
• b : bias
( 2) ( 1)
hij = max hi +m , j+t
( m ,t ) ∈ pool
23
CHAPTER 3. METHODOLOGY
(∑ )
k−1
( 1)
h =σ
t W f ⋅ X f ,t + b
f =0
24
CHAPTER 3. METHODOLOGY
(2) GRU- Gated Recurrent Unit Layer (RNN)
GRU is a type of RNN that processes sequences and remembers
important information.
Models sequential dependency across time:
GRU Update Equations:
• Update gate: how much to keep from the past
z t =σ ( W z x t + U z h t−1 )
• Candidate state:
~
h t=tanh ( W h x t +U h ( r t ⊙ ht−1 ) )
• Final state:
~
ht =( 1−z t ) ⊙ht −1 + z t ⊙ ht
Where:
• x t : input at time t
• ht : hidden state
• z t : update gate
• r t : reset gate
25
CHAPTER 3. METHODOLOGY
26
CHAPTER 3. METHODOLOGY
Layers:
(1) Dense Layer:
( 1)
h =σ ( W 1 x +b 1)
Output:
• A learned vector that summarizes the structural musical
features.
27
CHAPTER 3. METHODOLOGY
(2) Dropout
(3) Output Layer:
^y =softmax ( W out ⋅ h+ bout )
Where:
• ^y is the predicted probability for each raga class
Where:
• C : Number of raga classes
• y i: Ground truth (one-hot)
• ^y i: Predicted probability
28
CHAPTER 3. METHODOLOGY
Summary
Branch Learns from… Learns what…
CNN on Mel Spectrogram Instrument sound, texture
CNN + GRU on Cepstral + sequential Timbre + temporal change
MFCC info
GRU on Pitch Pitch contour Raga melody pattern, pakad
MLP on Stats Global features Swara use, harmony, rhythm,
tonic
Fusion → All of the above Predicts the most likely raga
Classifier
30
CHAPTER 3. METHODOLOGY
• R ESPONSE GENERATION
The system generates a response based on the analyzed query and provides
the answer in text format. Optionally, the response can also be delivered in
spoken form using text-to-speech technology.
Functional overview
• D ELETING C URRENT CHAT S ESSION
Users can delete their current chat session through the user interface. This re-
moves the ongoing conversation, allowing users to start anew without
previous context.
• C LEARING CACHE
Users have the option to clear the system’s cache. This removes all
temporary data, ensuring that the system starts fresh with no residual
information from past interactions.
31
Conclusion
4
The results of this QA system using the NLP project were very
promising and well produced as we intended. This approach involves a
large number of questions processing simultaneously and a large amount of
data given as input to the question and answering system. It involves Natural
language processing that is performed based on linguistic analysis. The
question is given in Natural language through the input device where the
system treats it as a query and re- turns the best suitable statement or
sentence as the Output.
The Question and answering system produces only the requested
infor- mation instead of searching the whole database or whole
documents as the search engine does, which is generally a time-consuming
process. In our daily life, information is increasing rapidly so extracting
even the required piece of information requires many resources.
32
Bibliography
[3] H. LIU, C. LI, Y. LI, AND Y. J. LEE, Improved baselines with visual instruction
tuning, 2024.
[4] H. LIU, C. LI, Q. WU, AND Y. J. LEE, Visual instruction tuning, 2023.
[5] B. OJOKOH AND E. ADEBISI, A review of question answering systems, Journal of Web En-
gineering, 17 (2018), pp. 717–758.
33