Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
cmd/tests
makefile
logs
tmp
bin/
tests/
.idea.md
*.*prof
.vscode/
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
clean:
start:
init:
build:
usage:
.PHONY: clean start init build usage
63 changes: 54 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,67 @@
## WBot
# WBot

A configurable, thread-safe web crawler, provides a minimal interface for crawling and downloading web pages.

### Features:
## Features

- Clean minimal API.
- Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation.
- Memory-efficient, thread-safe.
- Provides built-in interface: Fetcher, Store, Queue & a Logger.

## API

WBot provides a minimal API for crawling web pages.

```go
Run(links ...string) error
OnReponse(fn func(*wbot.Response))
Metrics() map[string]int64
Shutdown()
```

## Usage

```go
package main

import (
"fmt"
"log"

### [Examples & API](https://github.com/twiny/wbot/wiki)
"github.com/rs/zerolog"
"github.com/twiny/wbot"
"github.com/twiny/wbot/crawler"
)

### TODO
- [ ] Add support for robots.txt.
- [ ] Add test cases.
- [ ] Implement `Fetch` using Chromedp.
- [ ] Add more examples.
- [ ] Add documentation.
func main() {
bot := crawler.New(
crawler.WithParallel(50),
crawler.WithMaxDepth(5),
crawler.WithRateLimit(&wbot.RateLimit{
Hostname: "*",
Rate: "10/1s",
}),
crawler.WithLogLevel(zerolog.DebugLevel),
)
defer bot.Shutdown()

// read responses
bot.OnReponse(func(resp *wbot.Response) {
fmt.Printf("crawled: %s\n", resp.URL.String())
})

if err := bot.Run(
"https://crawler-test.com/",
); err != nil {
log.Fatal(err)
}

log.Printf("finished crawling\n")
}

```

### Bugs

Bugs or suggestions? Please visit the [issue tracker](https://github.com/twiny/wbot/issues).
10 changes: 0 additions & 10 deletions config.go

This file was deleted.

66 changes: 66 additions & 0 deletions crawler/config.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
package crawler

import (
"runtime"
"time"

"github.com/twiny/poxa"
)

const (
defaultReferrer = "https://www.google.com/search"
defaultUserAgent = "WBot/v0.2.0 (+https://github.com/twiny/wbot)"
defaultTimeout = 10 * time.Second
defaultMaxBodySize = int64(1024 * 1024 * 5) // 5MB
)

type (
config struct {
parallel int
maxDepth int32
maxBodySize int64
timeout time.Duration
userAgents poxa.Spinner[string]
referrers poxa.Spinner[string]
proxies poxa.Spinner[string]
}
)

func newConfig(maxDepth int32, userAgents, referrers, proxies []string) *config {
if maxDepth <= 0 {
maxDepth = 10
}

var conf = &config{
parallel: runtime.NumCPU(),
maxDepth: maxDepth,
maxBodySize: defaultMaxBodySize,
timeout: defaultTimeout,
userAgents: poxa.NewSpinner(defaultUserAgent),
referrers: poxa.NewSpinner(defaultReferrer),
proxies: nil,
}

if len(userAgents) > 0 {
uaList := poxa.NewSpinner(userAgents...)
if uaList != nil {
conf.userAgents = uaList
}
}

if len(referrers) > 0 {
refList := poxa.NewSpinner(referrers...)
if refList != nil {
conf.referrers = refList
}
}

if len(proxies) > 0 {
proxyList := poxa.NewSpinner(proxies...)
if proxyList != nil {
conf.proxies = proxyList
}
}

return conf
}
Loading