Web Crawler in Go

This Go program is a simple web crawler that extracts URLs from a webpage up to a specified depth.

How to Run

Make sure you have Go installed on your machine. If not, you can download and install it from here.

Clone the repository:

git clone https://github.com/akhiljns/go-crawler.git

Change into the project directory:
```
cd go-crawler
```
Run the program:
```
go run main.go -url https://www.example.com/ -depth 2
```
Replace the URL and depth parameters as needed.
Branch Information:
- The master branch contains the concurrent code. To switch to the synchronous version, run:
```
git checkout sync-crawler
```
Code Progression:
- To understand how the code progressed from synchronous to concurrent, please to go through the commit history. Commits provide insights into the step-by-step development process (can take a look at https://github.com/akhiljns/go-crawler/pull/1/files)

Visited Map Usage:
- The code utilizes a visited map to keep track of visited URLs, preventing redundant processing.
Initialization with CrawlWebpage:
- The CrawlWebpage function sets the initial depth for the root URL and initiates the crawling process.
Recursive Crawling with Crawl Function:
- The crawl function fetches HTML content, extracts links, and spawns goroutines for unvisited URLs.
Asynchronous Processing with Goroutines:
- Goroutines are used for asynchronous processing, enhancing performance by parallelizing URL crawling.
Concurrency Safety with Mutex:
- To ensure data consistency, a mutex (visitedMutex) is employed to synchronize access to the visited map.
Efficient and Safe Crawling:
- The combination of recursion, goroutines, and mutex provides an efficient and safe mechanism for web crawling in a concurrent environment.

Correctness:
- The program correctly crawls web pages and extracts URLs. Ensure you have a stable internet connection and correct URL input.
Simplicity:
- The code is straightforward, following a standard web crawling pattern using goroutines and a wait group.
Maintainability:
- The code is reasonably maintainable; functions are relatively short and focused on specific tasks.
Performance:
- The performance is acceptable for a simple web crawler. Consider adding rate limiting for web requests for improved politeness towards servers.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
testdata/templates		testdata/templates
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go