Project Report: Web Systems & Data Engineering

Directory Structure

.
├── code                                    # All the code for project
│   ├── crawling                            # Crawling classes and functions
│   │   ├── __init__.py
│   │   ├── crawler.py
│   │   └── node.py
│   ├── deduplication                       # Deduplication classes and functions
│   │   ├── __init__.py
│   │   ├── deduplicator.py
│   │   ├── feature_extractor.py
│   │   └── preprocessor.py
│   ├── shared                              # Utilities used throughout the codebase
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   └── utils.py
│   ├── crawl.py                            # Main entry point for crawling assignment
│   ├── deduplicate.py                      # Main entry point for deduplication assignment
│   ├── setup_server.sh                     # Loads the server's docker image
│   ├── start_server.sh                     # Starts the server in docker container
│   ├── stop_server.sh                      # Stops the running docker containers
│   └── test_crawling.sh                    # Tests crawling on given config
├── config.yaml                             # All the configurations for both the assignments
├── data                                    # Holds data for deduplication
│   ├── data.csv
│   └── deduplicated_data.csv
├── docker_images                           # Contains the web server image for simulation
│   └──server.tar
├── docs                                    # Assignment reports
│   ├── CRAWLING_REPORT.md
│   └── DEDUPLICATION_REPORT.md
├── logs                                    # Logs generated while running code
│   ├── crawling.log
│   └── deduplication.log
├── outputs                                 # All the plots and screenshots
│   ├── crawled_pages_graph.png
│   ├── dedup_sample_output.png
│   ├── metric_plot_....png
│   └── uv_gantt_chart_....png
├── README.md                               # This file
└── requirements.txt                        # Requirements for reproducing the project

Setup

# Create environment
python3 -m venv env

# Activate the environment
source ./env/bin/activate

# Install all the requirements
pip install -r requirements.txt

# Move into the code folder
cd code

Deduplication

# Note: Please upate the config.yaml file according to your data you need to process

# Run the deduplication script
python3 deduplicate.py

Crawling

NOTE: Make sure docker daemon is running

# Performs checksum on docker image and loads it onto docker (Do only once at the start)
sh setup_server.sh

# Note: Please upate the config.yaml file according to your preferences
# Test the code
sh test_crawling.sh

To stop the server once you are done

sh stop_server.sh

Part 1: Crawling

Refer to Crawling Report for documentation and analysis

Part 2: Deduplication

Refer to Deduplication Report for documentation and analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Report: Web Systems & Data Engineering

Directory Structure

Setup

Part 1: Crawling

Part 2: Deduplication

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
code		code
data		data
docker_images		docker_images
docs		docs
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

License

Vinit2244/Deduplication-and-Crawling

Folders and files

Latest commit

History

Repository files navigation

Project Report: Web Systems & Data Engineering

Directory Structure

Setup

Part 1: Crawling

Part 2: Deduplication

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages