Scrape scores from Vietnam's national high school exam with any source of choice.
Simply run:
go get github.com/012e/scrape-thptgo install github.com/012e/scrape-thpt@latestBy default, scrapes score from Báo An Giang
and saves them to students table.
Basic usage:
$ scrape-thpt -help
-con int
Number of concurrent connections.
Tweaks this number to scrape faster. (default 3)
-end int
End index, default value is start index
-start int
Start index
-try int
Total tries until give up scraping a candidate number (default 3)All the scraping is mainly based on ScrapeSource interface:
type ScrapSource interface {
// GetRequest returns a request to the intended scrape source
GetRequest(sbd int) (*http.Request, error)
// ParseResponse parses the response after getting the response from the requested source
ParseResponse(resp *http.Response) (*models.Student, error)
}For examples checkout scrapesources. It has already implemented scrape sources from baoangiang.com.vn, vietnamnet.vn, angiang.edu.vn.
- To begin, create a new
Scraper:
db := createGormDB()
scraper := scraper.NewScraper(scraper.Config{
ConcurrentConnection: 3, // three goroutines for scraping
StartIndex: 51000001, // scrapes between those range
EndIndex: 51000010,
Retries: 3, // will retries 3 times before giving up
DB: db, // any gorm instance
Source: baoag.Scraper{}, // anything implements `ScrapeSource` interface
})Currently, gorm is the only supported orm.
- Start scraping and handle errors (slice of structs contain error and candidate number):
scraper.Run()
errors := scraper.GetErrors()
for _, err := range errors {
fmt.Println("failed %d: %v", err.ID, err.Err)
// handle error
}MIT license.