.
├── code # All the code for project
│ ├── crawling # Crawling classes and functions
│ │ ├── __init__.py
│ │ ├── crawler.py
│ │ └── node.py
│ ├── deduplication # Deduplication classes and functions
│ │ ├── __init__.py
│ │ ├── deduplicator.py
│ │ ├── feature_extractor.py
│ │ └── preprocessor.py
│ ├── shared # Utilities used throughout the codebase
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ └── utils.py
│ ├── crawl.py # Main entry point for crawling assignment
│ ├── deduplicate.py # Main entry point for deduplication assignment
│ ├── setup_server.sh # Loads the server's docker image
│ ├── start_server.sh # Starts the server in docker container
│ ├── stop_server.sh # Stops the running docker containers
│ └── test_crawling.sh # Tests crawling on given config
├── config.yaml # All the configurations for both the assignments
├── data # Holds data for deduplication
│ ├── data.csv
│ └── deduplicated_data.csv
├── docker_images # Contains the web server image for simulation
│ └──server.tar
├── docs # Assignment reports
│ ├── CRAWLING_REPORT.md
│ └── DEDUPLICATION_REPORT.md
├── logs # Logs generated while running code
│ ├── crawling.log
│ └── deduplication.log
├── outputs # All the plots and screenshots
│ ├── crawled_pages_graph.png
│ ├── dedup_sample_output.png
│ ├── metric_plot_....png
│ └── uv_gantt_chart_....png
├── README.md # This file
└── requirements.txt # Requirements for reproducing the project
# Create environment
python3 -m venv env
# Activate the environment
source ./env/bin/activate
# Install all the requirements
pip install -r requirements.txt
# Move into the code folder
cd code-
Deduplication
# Note: Please upate the config.yaml file according to your data you need to process # Run the deduplication script python3 deduplicate.py
-
Crawling
NOTE: Make sure docker daemon is running
# Performs checksum on docker image and loads it onto docker (Do only once at the start) sh setup_server.sh # Note: Please upate the config.yaml file according to your preferences # Test the code sh test_crawling.sh
To stop the server once you are done
sh stop_server.sh
Refer to Crawling Report for documentation and analysis
Refer to Deduplication Report for documentation and analysis