This project implements a search engine for scientific research papers using Apache Lucene.
The system allows users to perform advanced searches with customizable queries, synonym expansion, wildcard search, sorting options, and search history management.
The engine is designed to retrieve the most relevant research papers based on user-defined queries and provides multiple ways of presenting and storing results.
- Source: Kaggle - NIPS Papers 1987–2019
- Preprocessing:
- Random sample of 500 papers selected
- Removed empty fields (e.g.,
abstract
) - Cleaned line breaks in
full_context
- Final dataset stored in
papers_cleaned.csv
- Main → Handles the user interface and search flow
- CSVReader → Loads and parses the dataset
- SearchHistory → Stores and retrieves search history; provides query suggestions
- SearchResultsWriter → Saves search results to
.txt
- SearchResultsWriterHTML → Saves search results to
.html
with highlighted terms - LuceneSearch → Builds the index, performs searches, ranking, highlighting, and result pagination
- Field-based search (title, year, full text, etc.)
- Synonym expansion and improved query suggestion
- Wildcard search (
*
,?
) support - Result highlighting in console and HTML
- Sorting results by publication year (ascending/descending)
- Search history with query reuse suggestions
- Results presentation:
- Console (paginated, 10 per page)
- Text file (
.txt
) - HTML file (
.html
)
Enter the field you want to search: title
Enter the query: nlp