CIS555 Course Project

Team member:#

name | seas login | division of labor --------------- | ----------------- | Yifan Li: | lyifan | PageRank Tianxiang Dong: | dtianx | Crawler Hanyu Yang: | hany | Indexer Yunwen Deng: | dyunwen | Search Engine and User Interface

Implemented Features

We built a distributed and scalable search engine with distributed data storage system.

The crawler adopts the Mecartor style and is able to run distributedly and multithreading.
Indexer indexes on monogram, bigram and trigram words from the crawled documents and runs MapReduce job with Apache Hadoop to construct lexicon storage and inverted index storage. The storage is then partitioned into 5 machines with effort to balance and avoid data skew.
Page rank engine builds revised web link graph then runs MapReduce job to do iterations of computation with Apache Spark. We consider 15 iteration as the point to converge.
All data, including inverted index, lexicon and pagerank scores, is partitioned and stored in local Berkeley DB on 5 machines distributiedly. All original documents are stored in Amazon S3.
The search engine retrieves data from storage worker servers directly when a query request arrives. Internally, it runs MapReduce job with Apache Spark to rank and sort and then output the final results to the search engine. The search engine renders result pages back to user.

Extra Credits

Pagerank Mapreduce job was done with ApacheSpark.
We implemented fault tolerance with possible storage machine failure, meaningly if one or two machines are down during searching time, the search engine is still working and provides reliable searching results as well.
We implemented autocompletion feature to the search engine user interface.

List of source code

All source code is located in /src folder.

How to run search engine

cmd: java -jar SearchEngine.jar

cmd: java -jar SearchWorkerServer.jar $[list of worker IPs] $workerIndex $masterIP $databaseDirectory

The main user interface

Sample search result for 'apple store'

only screenshot part of the result

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.settings		.settings
conf		conf
database0		database0
database1		database1
html		html
logFolder		logFolder
resources		resources
src/edu		src/edu
target		target
test_data		test_data
testpage		testpage
.classpath		.classpath
.gitignore		.gitignore
.project		.project
CIS555FinalReport.pdf		CIS555FinalReport.pdf
README.md		README.md
build.xml		build.xml
final-report.pdf		final-report.pdf
pom.xml		pom.xml
test.html		test.html
wikiapple.html		wikiapple.html
wikichina.html		wikichina.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CIS555 Course Project

Team member:#

Implemented Features

Extra Credits

List of source code

How to run search engine

The main user interface

Sample search result for 'apple store'

About

Uh oh!

Releases

Packages

Languages

SeverinVisionary/CIS555_SEARCH_ENGINE

Folders and files

Latest commit

History

Repository files navigation

CIS555 Course Project

Team member:#

Implemented Features

Extra Credits

List of source code

How to run search engine

The main user interface

Sample search result for 'apple store'

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages