Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Vinit2244/Deduplication-and-Crawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Report: Web Systems & Data Engineering

Directory Structure

.
├── code                                    # All the code for project
│   ├── crawling                            # Crawling classes and functions
│   │   ├── __init__.py
│   │   ├── crawler.py
│   │   └── node.py
│   ├── deduplication                       # Deduplication classes and functions
│   │   ├── __init__.py
│   │   ├── deduplicator.py
│   │   ├── feature_extractor.py
│   │   └── preprocessor.py
│   ├── shared                              # Utilities used throughout the codebase
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   └── utils.py
│   ├── crawl.py                            # Main entry point for crawling assignment
│   ├── deduplicate.py                      # Main entry point for deduplication assignment
│   ├── setup_server.sh                     # Loads the server's docker image
│   ├── start_server.sh                     # Starts the server in docker container
│   ├── stop_server.sh                      # Stops the running docker containers
│   └── test_crawling.sh                    # Tests crawling on given config
├── config.yaml                             # All the configurations for both the assignments
├── data                                    # Holds data for deduplication
│   ├── data.csv
│   └── deduplicated_data.csv
├── docker_images                           # Contains the web server image for simulation
│   └──server.tar
├── docs                                    # Assignment reports
│   ├── CRAWLING_REPORT.md
│   └── DEDUPLICATION_REPORT.md
├── logs                                    # Logs generated while running code
│   ├── crawling.log
│   └── deduplication.log
├── outputs                                 # All the plots and screenshots
│   ├── crawled_pages_graph.png
│   ├── dedup_sample_output.png
│   ├── metric_plot_....png
│   └── uv_gantt_chart_....png
├── README.md                               # This file
└── requirements.txt                        # Requirements for reproducing the project

Setup

# Create environment
python3 -m venv env

# Activate the environment
source ./env/bin/activate

# Install all the requirements
pip install -r requirements.txt

# Move into the code folder
cd code
  1. Deduplication

    # Note: Please upate the config.yaml file according to your data you need to process
    
    # Run the deduplication script
    python3 deduplicate.py
  2. Crawling

    NOTE: Make sure docker daemon is running

    # Performs checksum on docker image and loads it onto docker (Do only once at the start)
    sh setup_server.sh
    
    # Note: Please upate the config.yaml file according to your preferences
    # Test the code
    sh test_crawling.sh

    To stop the server once you are done

    sh stop_server.sh

Part 1: Crawling

Refer to Crawling Report for documentation and analysis


Part 2: Deduplication

Refer to Deduplication Report for documentation and analysis

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published