Codestin Search App

WBot - a web crawler

A configurable, thread-safe web crawler, provides a minimal interface for crawling and downloading web pages.

Features:

Clean minimal API.
Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation.
Memory-efficient, thread-safe.
Provides built-in interface: Fetcher, Store, Queue & a Logger.

WBot Specifications:

Interfaces

// Fetcher
type Fetcher interface {
	Fetch(req *Request) (*Response, error)
	Close() error
}

// Store
type Store interface {
	Visited(link string) bool
	Close() error
}

// Queue
type Queue interface {
	Add(req *Request)
	Pop() *Request
	Next() bool
	Close() error
}

// Logger
type Logger interface {
	Send(rep *Report)
	Close() error
}

API

// NewWBot
func NewWBot(opts ...Option) (*WBot, error)

// Crawl
func (wb *WBot) Crawl(link string) error

// SetOptions
func (wb *WBot) SetOptions(opts ...Option)

// Stream
func (wb *WBot) Stream() <-chan *Response

// Close
func (wb *WBot) Close()

Installation

requires Go1.18

go get github.com/twiny/wbot

Example

package main

import (
	"fmt"
	"time"

	"github.com/twiny/wbot"
)

//
func main() {
	// options
	opts := []wbot.Option{
		wbot.SetMaxDepth(5),
		wbot.SetParallel(10),
		wbot.SetRateLimit(1, 1*time.Second),
		wbot.SetMaxBodySize(1024 * 1024),
		wbot.SetUserAgents([]string{"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}),
	}

	// wbot
	bot := wbot.NewWBot(opts...)
	defer bot.Close()

	// crawl
	site := `https://www.github.com`

	// stream
	// stream
	go func() {
		count := 0
		for resp := range bot.Stream() {
			count++
			fmt.Printf("num: %d - depth: %d - visited url:%s - status:%d - body len: %d\n", count, resp.Depth, resp.URL.String(), resp.Status, len(resp.Body))
		}
	}()

	if err := bot.Crawl(site); err != nil {
		panic(err)
	}

	fmt.Println("done")
}

TODO

Bugs

Bugs or suggestions? Please visit the issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
config.go		config.go
fetcher.go		fetcher.go
filter.go		filter.go
go.mod		go.mod
go.sum		go.sum
limiter.go		limiter.go
logger.go		logger.go
option.go		option.go
queue.go		queue.go
request.go		request.go
response.go		response.go
rotator.go		rotator.go
store.go		store.go
wbot.go		wbot.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WBot - a web crawler

Features:

WBot Specifications:

Interfaces

API

Installation

Example

TODO

Bugs

About

Uh oh!

Releases 5

Packages

Uh oh!

Languages

License

twiny/wbot

Folders and files

Latest commit

History

Repository files navigation

WBot - a web crawler

Features:

WBot Specifications:

Interfaces

API

Installation

Example

TODO

Bugs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Languages

Packages