Thanks to visit codestin.com
Credit goes to github.com

Skip to content

mioxin/gscrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gscrape

gscrape is utility for scraping data from HTML structurs and saving to output file. gscrape work in multithread&concurancy mode. You may set number of the workers working in same time in the command line flag. The data that will be scraped from HTML should specified in the code of org.go file. The object OrgHtmlJson imlementing Scrape func of interface Scraper is customize the HTML data processing and the output data format.

Usage:

gscrape <-h> <-t NNN> <-w NNN> <-o output_file> -i input_file <url>

Flags:

-h -help:  Show help (Default: false)
-t:    The timeout in seconds for waiting a responses from web sites. (Default: 5)
-v -verbouse: Output fool log to StdOut (Default: false. Loging to the file gscrape.log in the local dir)
-w:    The number of workers working in the same time. (Default: 5)
-o:    File for result output. If the flag is absent then output will to the StdOut.
-i:    Input web src for scraping data. If the flag is absent then input should from last argument.

The list of URLs for processing defined in the input file or in the command line (one URL). The parameters in the URL can include the masks of the types:
[nnn:nnn] - range between numbers not including last number (GO slice style)
[word1;word2;word3] - enumeration of strings

For exaple the mask

html://www.site.com/path?chapter=[one;two]&page=[1:3]

will trasform to the URLs:

html://www.site.com/path?chapter=one&page=1
html://www.site.com/path?chapter=one&page=2
html://www.site.com/path?chapter=two&page=1
html://www.site.com/path?chapter=two&page=2

The line in the input file may be commented by "//"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages