A high-performance, multi-threaded scraping engine designed specifically to discover and harvest job data from Avature-hosted career portals. This system features active site discovery, robust rate-limiting handling, and a persistent queue-based architecture.
- Active Discovery: Automatically finds new Avature career sites using Google/Bing dorking, subdomains, and Certificate Transparency logs (
crt.sh). - Smart Scraping: Uses multiple strategies to extract job links (API interception, RSS feeds, Sitemaps, HTML fallback).
- Resilience: Intelligent handling of HTTP 429/406 rate limits with "Hot Domain" tracking and exponential backoff.
- Scalable Architecture: Worker-thread model managed by a central SQLite database queue.
- Validation: Pre-verifies seed URLs to ensure only valid targets enter the pipeline.
- Real-time Dashboard: Rich terminal UI (TUI) showing live progress, speed, and active threads.
The system operates on a Manager-Worker model orchestrated by main.py and backed by a SQLite database (avature_scraper.db).
- The Database: Acts as the central brain. It stores:
- Sites: Known Avature domains.
- Job Queue: State machine for URLs (
PENDING->PROCESSING->COMPLETED). - Jobs: Extracted data.
- Manager: The
main.pyprocess initializes the session, launches threads, and runs the TUI. - Workers:
DiscoveryWorker: Scans known sites for job listings.DorkingDiscoveryWorker: Searches the web for new Avature sites.ScraperWorker: Processes specific job pages to extract details.SeedLoader: Ingests and validates external URL lists.
To start the system in default mode. It will look for the database, create a new session, and start discovering sites automatically.
python main.pyIf you have a large list of potential URLs (e.g., Urls.txt), the system can verify and ingest them.
python main.py --seed Urls.txt --workers 10- Note: The system runs a
SeedLoaderthread that pre-verifies links (checking for 200 OK) before adding them to the scrape queue to save resources.
Adjust the number of threads based on your machine's capability and network bandwidth.
python main.py --workers 20 --discovery-workers 5--session <ID>: Resume a specific session.--reprocess: Reset allCOMPLETEDandFAILEDjobs toPENDINGto re-scrape them.--reset-stale: Unlock jobs that were stuck inPROCESSINGstatus due to a previous crash.--rediscover: Force the system to re-scan all known sites for new links.
The entry point. It sets up the GlobalStats object (for the TUI), initializes the Database, and spawns the thread pool. It handles the graceful shutdown and orchestrates the different worker types.
The core intelligence of the project.
AvatureScraperClass:- Discovery: It attempts to reverse-engineer the Avature "Instant Search" API to get JSON results directly. If that fails, it falls back to parsing
sitemap.xml, RSS feeds, and finally HTML scraping. - Extraction: Parses job pages to extract
title,description,location,posted_date, and dynamic fields. It usesBeautifulSoupand supports extracting metadata from URL slugs and JSON-LD tags.
- Discovery: It attempts to reverse-engineer the Avature "Instant Search" API to get JSON results directly. If that fails, it falls back to parsing
HotDomainTracker: A sophisticated singleton that tracks domains returning HTTP 406/429. It blocks all workers from accessing a "hot" domain for a cooldown period (starting at 5 minutes), automatically preventing IP bans.
Responsible for finding new Avature portals.
- Methods:
- CRT.sh: Queries public SSL certificate logs for
%.avature.net. - Dorking: Runs search queries (e.g.,
site:avature.net/careers) on Bing, DuckDuckGo, and optionally Google. - Candidates: Checks a built-in list of Fortune 500 companies.
- CRT.sh: Queries public SSL certificate logs for
A robust SQLite wrapper using WAL (Write-Ahead Logging) mode for high concurrency.
- Queue Logic: Uses atomic transactions (
BEGIN IMMEDIATE) to claim batches of jobs for workers, ensuring no two workers process the same link. - Storage: Uses an EAV (Entity-Attribute-Value) schema (
job_attributestable) to store flexible job metadata without needing schema migrations for every new field.
A high-speed, multi-threaded link checker. It uses HEAD/GET requests to validate URLs before they enter the database. It filters out 404s and soft-404s early in the pipeline.
The tool for getting data out of the system.
Usage:
# Export everything from the most recent session
python export_json.py
# Export specific session
python export_json.py --session Session_123abc
# Export only NEW jobs that haven't been exported before
python export_json.py --new-only
# Export to a custom file
python export_json.py -o my_data.jsonFeatures:
- Standardizes fields (maps
job_location,primary_location->location). - Marks records as exported in the DB to support incremental dumps.
- Filters out "Unknown" or empty values.
The exported JSON follows this structure:
[
{
"title": "Software Engineer",
"url": "https://careers.google.com/jobs/...",
"description": "Full job description text...",
"location": "Mountain View, CA",
"work_type": "Hybrid",
"posted_date": "2025-01-15",
"source_site": "https://careers.google.com",
"custom_field_1": "value",
"custom_field_2": "value"
}
]- "Connection pool is full":
- This is a warning from
urllib3when threads exceeds connection pool size. The system handles this gracefully, but you can reduce--workersif it spams logs.
- This is a warning from
- Rate Limiting (406 Errors):
- The system automatically detects this. You will see "Domain marked HOT" in the logs. The scraper will back off from that specific domain for 5-30 minutes. Do not restart the program; let it run, and it will retry later.
- No Jobs Found:
- Ensure the
discoveryworkers are running. - Check
logs/for detailed error messages. - Verify your IP hasn't been blocked by search engines (for dorking).
- Ensure the