WebCrawler — focused README (crawler.py)

This repository contains a small asynchronous web crawler. The main entrypoint and the focus of configuration is crawler.py — everything else (URL manager, simhash manager, state files) supports it.

This README gives a concise guide for running the crawler and the exact file/line locations of variables you can edit to tune the bot.

Highlights / features

Asynchronous crawling with configurable worker concurrency.
Global frontier implemented using URLManager (single global queue).
Robots.txt checks (configurable per instance).
Duplicate / near-duplicate detection via SimhashManager.
Persistent state: queued/visited/failed/simhash stored under urls_data/ so crawls can resume.

Quick start

Install requirements (see requirements.txt added to the project):

python -m pip install -r requirements.txt

Run the crawler (from repo root):

python crawler.py

To seed new start URLs, edit the seeds = [] list (file and line shown below) or add them at runtime via URLManager.add_url().

Files of interest

crawler.py — main crawler. Edit this to change crawler runtime behavior.
URLManager.py — global queue URL manager and persistence.
SimhashManager.py — near-duplicate detection using simhash.
urls_data/ — persisted state and raw HTML pages.

Where to change behavior (file + line numbers)

Below are the most useful variables and places to edit in crawler.py (paths and line numbers are based on the current repository code):

DATA_FOLDER (line 23)
- Location where raw HTML files are saved. Default: "urls_data/raw".
URL_MAP_FILE (line 24)
- JSON that maps saved filenames back to original URLs. Default: "urls_data/url_map.json".
NUM_WORKERS (line 28)
- Number of concurrent crawl workers. Increase for more parallelism, but ensure your machine/network can handle it. Default: 100.
TARGET_PAGES (line 29)
- Soft limit of pages to crawl during a run. When reached, workers stop. Default: 1000.
SAVE_INTERVAL (line 30)
- Interval (seconds) at which simhash and URL manager state are saved. Default: 10 seconds.
executor = ThreadPoolExecutor(max_workers=NUM_WORKERS*3) (line 32)
- Thread pool used to run blocking parsing tasks. Increase if parsing becomes a bottleneck.
save_queue = asyncio.Queue() (line 33)
- Queue used by workers to send pages to be written to disk by disk_writer().
User-Agent list (lines 108–114)
- The crawler picks a random User-Agent header from this list for each ClientSession. Edit or expand this list to vary headers.
url_manager = URLManager(disable_robots=True) (line 215)
- Instantiation of the URL manager. You can change disable_robots to False to enable robots.txt checks. You can also pass global_queue_maxsize to the constructor.
seeds = [] (line 225)
- Seed URLs to start crawling if no pending URLs exist in state. Add your start URLs here.
crawl_tasks creation uses NUM_WORKERS (line 231)
- Worker tasks are created with the number of workers above.

If you need different line numbers (file changed later), search for the variable names above in crawler.py to find the current location.

Runtime behavior summary

Workers fetch URLs from URLManager.get_url() (the global queue).
Each worker performs basic filtering:
- Skip non-HTML responses.
- Respect canonical URLs (if canonical differs, add canonical and skip original).
- Skip pages that metadata says are not English (heuristic) or that are near-duplicates per Simhash.
- Save HTML to DATA_FOLDER and update url_map.
- Extract links and add them back to the URLManager frontier.
SimhashManager prevents near-duplicate pages from being saved/processed.

Persistence & resuming

URLManager.save_state() and SimhashManager.save_state() are called periodically and when soft limits are reached.
State files live in urls_data/ and are reloaded on start by load_state() routines.

Requirements

See requirements.txt for the exact packages used. Basic list includes:

aiohttp
beautifulsoup4
simhash

Example quick edits

Reduce worker count to 10 (edit line 28):

NUM_WORKERS = 10

Add seed URLs (edit line 225):

seeds = [
    'https://example.com',
    'https://example.org',
]

Safety and etiquette

When enabling robots (recommended), respect robots.txt and crawl politeness. If disable_robots=True is used, only do so for controlled environments or with permission.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
files_manager		files_manager
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SimhashManager.py		SimhashManager.py
URLManager.py		URLManager.py
__init__.py		__init__.py
crawler.py		crawler.py
manager.ipynb		manager.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebCrawler — focused README (crawler.py)

Highlights / features

Quick start

Files of interest

Where to change behavior (file + line numbers)

Runtime behavior summary

Persistence & resuming

Requirements

Example quick edits

Safety and etiquette

About

Uh oh!

Releases

Packages

Languages

License

iPythonezta/BigCrawl

Folders and files

Latest commit

History

Repository files navigation

WebCrawler — focused README (crawler.py)

Highlights / features

Quick start

Files of interest

Where to change behavior (file + line numbers)

Runtime behavior summary

Persistence & resuming

Requirements

Example quick edits

Safety and etiquette

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages