12 II February 2024

Uploaded by

varshini.naravula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

12 II February 2024

Uploaded by

varshini.naravula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

12 II February 2024

https://doi.org/10.22214/ijraset.2024.58670
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 12 Issue II Feb 2024- Available at www.ijraset.com

PDF-Driven Q&A: A Review

Prof. Sneha R. Sontakke1, Shreyash V. Gondane2, Anup S. Nandedkar3, Sahil T. Chauhan4, Nama Z. Choudhari5
1
Assistant Professor at P R Pote College of Engineering and Management Amravati
2, 3, 4, 5
Student at P R Pote College of Engineering and Management Amravati

Abstract: Traditional methods of PDF retrieval often suffer from inefficiencies and inaccuracies due to their reliance on
keyword-based search algorithms. These methods usually don't understand what words really mean and have trouble with things
like different words meaning the same thing, one word meaning many things, and understanding the context. This makes the
search results not very good. This project proposes a novel approach to address these shortcomings by developing a
comprehensive system for efficient PDF document management and context-aware question answering. The system integrates
various components, including a user-friendly interface for PDF upload, Longchain techniques for breaking down lengthy
PDFs into manageable chunks, and vectorization using OpenAI's language models. These vectorized chunks are stored in the
Faiss Vector Database for rapid retrieval. User queries are converted into vectors using the same language model, and a
semantic search algorithm matches them against the stored PDF chunks to retrieve contextually relevant answers. By presenting
these answers in a user-friendly format, the system aims to enhance accessibility and usability, enabling seamless access to
pertinent information within PDF documents.
Keywords: Chatbot, PDF retrieval Information, Longchain Technique ,Faiss Vector Database, Chunks.

I. INTRODUCTION
In an era marked by the exponential growth of digital information, effective management and retrieval of knowledge from vast
repositories such as PDF documents pose significant challenges. To address this need, we propose a methodology for implementing
a comprehensive system that enables users to upload PDF files, automatically breaks them down into manageable "chunks," and
facilitates context-aware question answering. This methodology encompasses two key components: PDF Upload and Chunking, and
Context-Aware Question Answering. The former focuses on providing users with a user-friendly interface for uploading PDF files
and employs advanced techniques to segment lengthy documents into smaller, more digestible units. Leveraging state-of-the-art
technologies such as OpenAI's language models and the Faiss Vector Database, this component ensures efficient storage and
retrieval of the vectorized PDF chunks. The latter component, Context-Aware Question Answering, addresses the challenge of
retrieving relevant information from the stored PDF chunks in response to user queries. By converting queries into vectors using the
same model employed for PDF chunk vectorization, the system maintains consistency in representation. A semantic search
algorithm is then utilized to match these vectors against the stored chunks, considering semantic similarity to deliver contextually
relevant answers to the users. Through this methodology, we aim to provide a structured approach for enhancing information
accessibility and user experience, empowering individuals to efficiently navigate and extract insights from large repositories of PDF
documents
II. LITERATURE SURVEY

TABLE I
LITERATURE SURVEY.

Sr. Author Title Feature Year

No.

1. Sakib Shahriar Chatting With ChatGpt ChatGPT can integrate with knowledge bases to June 2023
And Kadhim provide relevant information based on large
Hayawi collections of text data, such as books, articles and
web pages.

2. Jacob Devlin Ming- BERT: Pre-training of Deep Natural Language Processing ,Deep June 2019
Wei Chang Kenton Bidirectional Transformers Learning,NLU(Natural Language Understanding).
Lee Kristina
Toutanova

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1553
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 12 Issue II Feb 2024- Available at www.ijraset.com

3. Xingxing Zhang, HIBERT: Document Level Pre- A method to pre-train document level hierarchical July 2019
Furu Wei and Ming training of Hierarchical bidirectional transformer encoders on unlabeled data.
Zhou Bidirectional Transformers for
Document Summarization

In the above given TABLE II (i.e. literature Survey) "Chatting With ChatGPT" by Sakib Shahriar and Kadhim Hayawi (2023)
explores the integration of ChatGPT, a chatbot technology, with knowledge bases to provide contextually relevant information. This
innovative approach leverages advanced natural language processing (NLP) techniques to enhance conversational experiences.
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Jacob Devlin, Ming-Wei Chang, Kenton
Lee, and Kristina Toutanova introduces a groundbreaking approach in natural language processing (NLP) through deep bidirectional
transformer models. This method significantly advances language understanding tasks by leveraging pre-training on large-scale
corpora. "HIBERT: Document Level Pre-Training of Hierarchical Bidirectional Transformers for Document Summarization" by
Xingxing Zhang, Furu Wei, and Ming Zhou presents a novel methodology for document summarization. By pre-training
hierarchical bidirectional transformers on unlabeled data, it achieves effective and informative document summarization.

III. METHODOLOGY
A. Methodology for Implementing PDF Upload and Chunking
1) User Interface for PDF Upload: Develop a user-friendly interface that allows users to easily upload PDF files. This interface
should be intuitive and accessible, enabling users to select and upload their desired PDF documents effortlessly.
2) Longchain Techniques for Chunking: Implement Longchain techniques, such as those provided by libraries like PyPDF2 or
PDFMiner, to partition lengthy PDF documents into smaller, manageable "chunks." These techniques identify logical divisions
within the PDF, such as paragraphs or sections, to create more digestible segments for processing.
3) Vectorization with OpenAI: Utilize OpenAI's language models, such as GPT-3, to convert each chunk of PDF content into high-
dimensional vectors. This involves transforming the textual information within the chunks into numerical representations that
capture semantic meaning and context, enabling advanced analysis and processing.
4) Faiss Vector Database Integration: Integrate the Faiss Vector Database to efficiently store the vectorized PDF chunks. Faiss is
a powerful library known for its scalability and fast retrieval capabilities, making it well-suited for managing large volumes of
high-dimensional vectors. By leveraging Faiss, the system can store and retrieve vectorized PDF chunks effectively, enabling
efficient processing and analysis.

Figure 1.1: You need to store both text and vector embedding in the database with vectors being the KEY. The process requires an
LLM to convert text chunk to vectors. The LLM should be the same for querying

The above figure (i.e Figure 1.1 and Figure 1.2) is taken from Anshumali Shrivastava,(June 2023). Understanding the Fundamental
Limitations of Vector-Based Retrieval for Building LLM-Powered Chatbots—(Part 1/3).ThirdAI Blog. Retrieved from [Medium].

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1554
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 12 Issue II Feb 2024- Available at www.ijraset.com

Figure 1.2: The Q &A Phase

B. Methodology for Context-Aware Question Answering

1) User Query Interface: Develop an intuitive user interface that enables users to input questions or queries into the system. This
interface should provide a seamless and user-friendly experience, allowing users to easily interact with the system and submit
their queries.
2) Conversion of Queries into Vectors: Utilize the same OpenAI model employed for PDF chunk vectorization to convert user
queries into vectors. By using consistent vector representations across queries and PDF chunks, the system ensures that queries
are appropriately matched with relevant document segments during the search process.
3) Semantic Search Algorithm: Implement a semantic search algorithm capable of performing vector matching against the Faiss-
stored PDF chunks. This algorithm should consider the semantic similarity between the query vector and the vectorized PDF
chunks, enabling the system to retrieve contextually relevant answers that accurately address the user's query.
4) Answer Retrieval and Presentation: Retrieve the contextually relevant answers from the Faiss Vector Database and present
them to the user in a clear and user-friendly format. This presentation format could include ranked lists of answers or
highlighted excerpts from the original PDF documents, ensuring that users can easily review and comprehend the information
provided by the system.

IV. LIMITATION
1) Technical Limitations: AI-powered chatbots may encounter technical limitations such as processing speed, memory constraints,
or compatibility issues with certain document formats or languages.
2) Handling of Non-Textual Content: PDFs can include non-textual content such as images, graphs, and charts, which may contain
valuable information relevant to the user's query. However, the chatbot may lack the capability to analyze and interpret such
content, limiting its ability to provide comprehensive answers based solely on textual information extracted from the PDF
3) Integration Complexity: Integrating AI document readers into existing workflows and systems can be complex and time-
consuming. Compatibility issues, data synchronization, and customization requirements may arise during the integration
process, requiring specialized technical expertise and resources

V. FUTURE SCOPE
The future scope of the chatbot could be to enhance its ability to extract and understand information from PDF documents provided
by users. This could involve implementing advanced natural language processing techniques to accurately interpret the content,
allowing the chatbot to provide more insightful and tailored responses based on the information within the PDFs. Additionally,
incorporating machine learning algorithms could help the chatbot to learn and improve its ability to extract suitable information over
time, making it even more effective at assisting users with their queries.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1555
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 12 Issue II Feb 2024- Available at www.ijraset.com

VI. CONCLUSION
PDF-driven question-and-answer (Q&A) retrieval systems play a pivotal role in modern information management by efficiently
extracting and categorizing data from documents. Leveraging advanced algorithms, these systems automate the processing of PDF
files, identifying pertinent information and structuring it for easy access and analysis. However, challenges such as initial
investment requirements, data dependency, and technical limitations pose obstacles to their implementation. Despite these
challenges, PDF-driven Q&A retrieval systems offer invaluable solutions, streamlining information retrieval processes and
enhancing efficiency within organizations. By automating tasks, reducing manual intervention, and facilitating rapid access to
relevant data, these systems contribute to heightened productivity, informed decision-making, and optimized resource utilization.
Continuous advancements in artificial intelligence and natural language processing further augment their capabilities, reinforcing
their status as indispensable tools for modern enterprises.

REFERENCES
[1] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers). DOI: 10.18653/v1/N19-1423
[2] Shahriar, S., & Hayawi, K. (2023). Let’s Have a Chat! A Conversation with ChatGPT: Technology, Applications, and Limitations. Artificial Intelligence and
Applications. DOI: 10.47852/bonviewAIA3202939.
[3] Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document
Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059–5069, Florence, Italy. Association
for Computational Linguistics
[4] Yadav, D., Desai, J., & Yadav, A. K. (2022). Automatic Text Summarization Methods: A Comprehensive Review. arXiv. Retrieved from
https://doi.org/10.48550/arXiv.2204.0184
[5] Lin, J., Pradeep, R., Teofili, T., & Xian, J. (2023, August 29). Vector Search with OpenAI Embeddings: Lucene Is All You Need. arXiv:2308.14963v1 [cs.IR].
[6] Pandiarajan, S., Yazhmozhi, V. M., & Praveen Kumar, P. (2015). Semantic Search Engine Using Natural Language Processing. In Proceedings of the
International Conference on Advanced Computing and Communication Systems (pp. 641-649). DOI: 10.1007/978-3-319-07674-4_53.
[7] Shaikh, A., More, D., Puttoo, R., Shrivastav, S., & Shinde, S. (2019). A Survey Paper on Chatbots. International Research Journal of Engineering and
Technology (IRJET), 06(04), Volume: 06 Issue: 04.
[8] Shrivastava, A. (June 2023). Understanding the Fundamental Limitations of Vector-Based Retrieval for Building LLM-Powered Chatbots (Part 1/3). ThirdAI
Blog. Retrieved from [Medium].
[9] D. Hingu, D. Shah, and S. S. Udmale, "Automatic Text Summarization of Wikipedia Articles," 2015 International Conference on Communication, Information
& Computing Technology (ICCICT), Mumbai, India, Jan. 16-17.
[10] X. Yang, K. Yang, T. Cui, M. Chen, and L. He, "A Study of Text Vectorization Method Combining Topic Model and Transfer Learning," ISJ Theoretical &
Applied SciencE, vol. 1, no. 2, pp. XX-XX, Year.
[11] K. Singh and M. Shashi, "Vectorization of Text Documents for Identifying Unifiable News Articles," International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 10, no. 7, 2019.

CNAS (PS-DBM) June 13, 2025
No ratings yet
CNAS (PS-DBM) June 13, 2025
5 pages
Hcu Dump
100% (3)
Hcu Dump
86 pages
Project Report 8th Sem 2 Final Edit
No ratings yet
Project Report 8th Sem 2 Final Edit
29 pages
An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain For Enhanced Data Retrieval (#1602597) - 4445287
No ratings yet
An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain For Enhanced Data Retrieval (#1602597) - 4445287
13 pages
Chat With PDF: Your Go-To Website For Smarter Exam Prep With PDF Chat Support
No ratings yet
Chat With PDF: Your Go-To Website For Smarter Exam Prep With PDF Chat Support
6 pages
AI Bot That Interacts With Multiple Pdfs
No ratings yet
AI Bot That Interacts With Multiple Pdfs
1 page
12 V May 2024
No ratings yet
12 V May 2024
9 pages
Sri Ramakrishna Institute of Technology: Department of Computer Science and Engineering
No ratings yet
Sri Ramakrishna Institute of Technology: Department of Computer Science and Engineering
14 pages
An Effective Query System Using Llms and Langchain IJERTV12IS060161
No ratings yet
An Effective Query System Using Llms and Langchain IJERTV12IS060161
4 pages
An Effective Query System Using Llms and Langchain IJERTV12IS060161
No ratings yet
An Effective Query System Using Llms and Langchain IJERTV12IS060161
3 pages
AI Chatbot for Institute Queries
No ratings yet
AI Chatbot for Institute Queries
8 pages
RP Journal-2
No ratings yet
RP Journal-2
54 pages
Final Project
No ratings yet
Final Project
17 pages
Research Paper
No ratings yet
Research Paper
9 pages
Mini Project Docubot Power Point
No ratings yet
Mini Project Docubot Power Point
17 pages
DB - Report - Smart Solar Powered Agricultural Robot
No ratings yet
DB - Report - Smart Solar Powered Agricultural Robot
21 pages
Chicken Curry Making
No ratings yet
Chicken Curry Making
25 pages
AI in Education: Smart Exams
No ratings yet
AI in Education: Smart Exams
5 pages
5th - DE Presentation Format
No ratings yet
5th - DE Presentation Format
12 pages
Major Complete Presentation - Major Project Presentation.
No ratings yet
Major Complete Presentation - Major Project Presentation.
28 pages
Caqecf
No ratings yet
Caqecf
2 pages
Automated Question Generator Using NLP
No ratings yet
Automated Question Generator Using NLP
8 pages
10.2478 - Picbe 2024 0018
No ratings yet
10.2478 - Picbe 2024 0018
14 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
An Innovative Algorithm For Enhanced PDF-Based Chatbot in Domain-Specific Question Answering
No ratings yet
An Innovative Algorithm For Enhanced PDF-Based Chatbot in Domain-Specific Question Answering
7 pages
B.Tech 6 Semester Minor Project Presentation: Transformer-Based Textual Reasoning
No ratings yet
B.Tech 6 Semester Minor Project Presentation: Transformer-Based Textual Reasoning
12 pages
Applsci 14 09125
No ratings yet
Applsci 14 09125
19 pages
Synopsis of Final Year Project (Amaan)
No ratings yet
Synopsis of Final Year Project (Amaan)
13 pages
Paper 10
No ratings yet
Paper 10
13 pages
Problem Statement
No ratings yet
Problem Statement
4 pages
Saraswathi
No ratings yet
Saraswathi
8 pages
Physics On Autopilot Exploring The Use of An AI As
No ratings yet
Physics On Autopilot Exploring The Use of An AI As
20 pages
A Novel Approach To Analyzing The Impact of AI Cha
No ratings yet
A Novel Approach To Analyzing The Impact of AI Cha
8 pages
AI For Research
No ratings yet
AI For Research
3 pages
Blackbook Format
No ratings yet
Blackbook Format
70 pages
AI-Driven Question Generation Tool
No ratings yet
AI-Driven Question Generation Tool
4 pages
Ijsred V8i3p312
No ratings yet
Ijsred V8i3p312
6 pages
Rita 3381842 PP
No ratings yet
Rita 3381842 PP
10 pages
Paper Format IJNRD
No ratings yet
Paper Format IJNRD
5 pages
Fin Irjmets1687886863
No ratings yet
Fin Irjmets1687886863
4 pages
SSRN 4699304
No ratings yet
SSRN 4699304
11 pages
58 Chapagain+et+al Can+AI+Solve+Physics+Problems-Final
No ratings yet
58 Chapagain+et+al Can+AI+Solve+Physics+Problems-Final
7 pages
Goya Journal 5
No ratings yet
Goya Journal 5
5 pages
Ali Ahmad and Rameez - Project - Proposal
No ratings yet
Ali Ahmad and Rameez - Project - Proposal
5 pages
Lecture 01 - Foundations of Conversational AI
No ratings yet
Lecture 01 - Foundations of Conversational AI
21 pages
06web Application For Rag Implementation and Testing
No ratings yet
06web Application For Rag Implementation and Testing
30 pages
Chatbot Web Application Using DIET
No ratings yet
Chatbot Web Application Using DIET
9 pages
Chatbot Research
No ratings yet
Chatbot Research
6 pages
Minor Project
No ratings yet
Minor Project
22 pages
Dami Reference
No ratings yet
Dami Reference
7 pages
AI Database Query System
No ratings yet
AI Database Query System
7 pages
AI Chatbots Survey: Industry Impact
No ratings yet
AI Chatbots Survey: Industry Impact
10 pages
ChatGPT vs. Engineering Education Assessment A Multidisciplinary and Multi-Institutional Benchmarking and Analysis of This Generative Artificial
No ratings yet
ChatGPT vs. Engineering Education Assessment A Multidisciplinary and Multi-Institutional Benchmarking and Analysis of This Generative Artificial
57 pages
LLM-Powered Natural Language Text Processing For O
No ratings yet
LLM-Powered Natural Language Text Processing For O
14 pages
Ask Your PDF (Thesis)
No ratings yet
Ask Your PDF (Thesis)
42 pages
Progression and Development of Ai
No ratings yet
Progression and Development of Ai
4 pages
Chatbotpaper
No ratings yet
Chatbotpaper
10 pages
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
83 pages
Chatbot Research Paper
No ratings yet
Chatbot Research Paper
5 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
Mastering Predictive Analytics With Python Exploit The Power of Data in Your Business by Building Advanced Predictive Modeling Applications With Python Joseph Babcock Instant Download
No ratings yet
Mastering Predictive Analytics With Python Exploit The Power of Data in Your Business by Building Advanced Predictive Modeling Applications With Python Joseph Babcock Instant Download
13 pages
Type DHM9B (Digital) Load Cell: Short Description
100% (1)
Type DHM9B (Digital) Load Cell: Short Description
2 pages
All Pricelist
No ratings yet
All Pricelist
1 page
Interface Management On Megaprojects: A Case Study
No ratings yet
Interface Management On Megaprojects: A Case Study
6 pages
QQPlayer Media Playback Log
No ratings yet
QQPlayer Media Playback Log
5 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
4 pages
Printed 黃建華Oracle - EBS Workflow
No ratings yet
Printed 黃建華Oracle - EBS Workflow
90 pages
2025 UP College of Law LAE Manual For Examinees
No ratings yet
2025 UP College of Law LAE Manual For Examinees
23 pages
2 Smartforms
No ratings yet
2 Smartforms
7 pages
E-Wallet Adoption and Impact Study
No ratings yet
E-Wallet Adoption and Impact Study
30 pages
Td+Correction Enpu PDF Redresseur Équipement
No ratings yet
Td+Correction Enpu PDF Redresseur Équipement
1 page
Certified Scrum Master (CSM) : Description
No ratings yet
Certified Scrum Master (CSM) : Description
1 page
Dffccil 2 X 25 KV Tender Document
No ratings yet
Dffccil 2 X 25 KV Tender Document
264 pages
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
No ratings yet
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
44 pages
Design and Implementation of An Embedded Edge-Processing Water Quality Monitoring System For Underground Waters
No ratings yet
Design and Implementation of An Embedded Edge-Processing Water Quality Monitoring System For Underground Waters
4 pages
Indradrive MPX - 1x
No ratings yet
Indradrive MPX - 1x
90 pages
PGP Machine Learning Brochure
No ratings yet
PGP Machine Learning Brochure
12 pages
Programming Fundamentals: Lecture # 1
No ratings yet
Programming Fundamentals: Lecture # 1
42 pages
Psychosocial Factors Influencing Students Attitude Towards Computer Based Test
No ratings yet
Psychosocial Factors Influencing Students Attitude Towards Computer Based Test
7 pages
ASUG Attendee List
No ratings yet
ASUG Attendee List
90 pages
Pfsense Configuration
No ratings yet
Pfsense Configuration
31 pages
The Possibility of Creating Thinking Machines Raises A Host of Ethical Issues.
No ratings yet
The Possibility of Creating Thinking Machines Raises A Host of Ethical Issues.
2 pages
Linkdin Template
No ratings yet
Linkdin Template
2 pages
Tourism MS
No ratings yet
Tourism MS
22 pages
Course Work Database Programming
No ratings yet
Course Work Database Programming
18 pages
Cloud Data Security for IT Experts
100% (1)
Cloud Data Security for IT Experts
7 pages
Wikipedia Consensus
No ratings yet
Wikipedia Consensus
6 pages

12 II February 2024

Uploaded by

12 II February 2024

Uploaded by

12 II February 2024

PDF-Driven Q&A: A Review

Sr. Author Title Feature Year

Figure 1.2: The Q &A Phase

B. Methodology for Context-Aware Question Answering

You might also like