GenUrls

A python tool to crawl a website for all linked urls, similar to ScreamingFrog but without the ~~bloat~~ features.

Will crawl the website and follow links found within the HTML. There is a config file to set domains, robot.txt obey, black listed urls/snippits, and output file.

There is no guarentee to find all pages/posts/urls on a website, if the page is not linked anywhere on the website, this bot will not find it (targeted marketing landing pages for example).

No recurrsion limits and CONCURRENT_REQUESTS set to 32 by default.

Requirements

Python3
venv :: sudo apt install python3.8-venv
Scrapy :: handled in requirements.txt

Install

Windows users are assumed to be in WSL

$ git clone [email protected]:techb/GenUrls.git
$ cd GenUrls
$ python3 -m venv venv
$ source venv/bin/activate
(venv)$ pip install -r requirements.txt

Settings

./scr/GenUrls/crawl_config.py
- DOMAINS :: allowed domains to crawl
- ENTRYURLS :: url where the spider starts.
  - sitemap and/or homepage is a good starting point
- DENY :: strings,urls,paths to blacklist
- LOG_LEVEL :: level of output in terminal when activly crawling
- FEED_FORMAT :: results file format csv, json
- FEED_URI :: path and file name of the results, defaults to same dir with RESULTS as name
- OBEY :: obey the robots.txt file, False will ignore it, True will obey

Run

(venv)$ cd src
(venv)$ vim GenUrls/crawl_config.py
- update the config file, save and exit
(venv)$ python run_spider.py
Output will be in your chosen format
- start_url :: the found url, in case of redirect this is the url that has been redirected
- redirect :: the final redirected url, will be null/None if no redirects happened
- status :: HTTP status code of the start_url

Dev

Besides config, most dev happens in ./src/GenUrls/spiders/genurls_bot.py.

GenUrlsBot

Driving class of the crawling spider.

Properties

name :: Spiders name
allowed_domains :: List of domains that the spider can crawl.
- ./src/GenUrls/crawl_config.py :: DOMAINS
start_urls :: List of entry point urls to start the crawl.
- ./src/GenUrls/crawl_config.py :: ENTRYURLS
handle_httpstatus_list :: List of HTTP status codes to watch for.
crawled_data :: List of Dict's holding all crawled urls and class data.
rules :: List of Scrapy Rules containing a LinkExtractor, callback, follow, and deny.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenUrls

Requirements

Install

Settings

Run

Dev

GenUrlsBot

Properties

About

Uh oh!

Releases

Packages

Languages

techb/GenUrls

Folders and files

Latest commit

History

Repository files navigation

GenUrls

Requirements

Install

Settings

Run

Dev

GenUrlsBot

Properties

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages