Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Concurrent HTML website crawler

Andre053/Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Web Crawler

  • Simple web crawler written in Go
  • Given a root URL, it parses the page for web links and recursively checks each site, up to a max depth
  • Returns a list of sites found and their frequency

How to Use

  1. With Go installed, run: go build main.go
  2. Start the program with: ./main
  3. Enter a URL to search, omitting the protocol, HTTP is assumed
    • Ex. google.com, twitter.com, etc.
  4. Enter a depth to search to

Notes

  • Final output sorts URLs based on frequency
  • Need better input validation for URLs, check multiple inputs

Extensions TODO

  • Show progress of URLs found -> Timer shows scrape duration
  • Utilize concurrency -> Done for web scraping
  • Get cookies
  • Enumerate all network calls made by a website

About

Concurrent HTML website crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages