This Go program is a simple web crawler that extracts URLs from a webpage up to a specified depth.
-
Make sure you have Go installed on your machine. If not, you can download and install it from here.
-
Clone the repository:
git clone https://github.com/akhiljns/go-crawler.git
-
Change into the project directory:
cd go-crawler -
Run the program:
go run main.go -url https://www.example.com/ -depth 2
Replace the URL and depth parameters as needed.
-
Branch Information:
-
The master branch contains the concurrent code. To switch to the synchronous version, run:
git checkout sync-crawler
-
-
Code Progression:
- To understand how the code progressed from synchronous to concurrent, please to go through the commit history. Commits provide insights into the step-by-step development process (can take a look at https://github.com/akhiljns/go-crawler/pull/1/files)
-
Visited Map Usage:
- The code utilizes a
visitedmap to keep track of visited URLs, preventing redundant processing.
- The code utilizes a
-
Initialization with CrawlWebpage:
- The
CrawlWebpagefunction sets the initial depth for the root URL and initiates the crawling process.
- The
-
Recursive Crawling with Crawl Function:
- The
crawlfunction fetches HTML content, extracts links, and spawns goroutines for unvisited URLs.
- The
-
Asynchronous Processing with Goroutines:
- Goroutines are used for asynchronous processing, enhancing performance by parallelizing URL crawling.
-
Concurrency Safety with Mutex:
- To ensure data consistency, a mutex (
visitedMutex) is employed to synchronize access to thevisitedmap.
- To ensure data consistency, a mutex (
-
Efficient and Safe Crawling:
- The combination of recursion, goroutines, and mutex provides an efficient and safe mechanism for web crawling in a concurrent environment.
-
Correctness:
- The program correctly crawls web pages and extracts URLs. Ensure you have a stable internet connection and correct URL input.
-
Simplicity:
- The code is straightforward, following a standard web crawling pattern using goroutines and a wait group.
-
Maintainability:
- The code is reasonably maintainable; functions are relatively short and focused on specific tasks.
-
Performance:
- The performance is acceptable for a simple web crawler. Consider adding rate limiting for web requests for improved politeness towards servers.