Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Bitwise-01/Avature

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Avature Scraper System

A high-performance, multi-threaded scraping engine designed specifically to discover and harvest job data from Avature-hosted career portals. This system features active site discovery, robust rate-limiting handling, and a persistent queue-based architecture.

Key Features

  • Active Discovery: Automatically finds new Avature career sites using Google/Bing dorking, subdomains, and Certificate Transparency logs (crt.sh).
  • Smart Scraping: Uses multiple strategies to extract job links (API interception, RSS feeds, Sitemaps, HTML fallback).
  • Resilience: Intelligent handling of HTTP 429/406 rate limits with "Hot Domain" tracking and exponential backoff.
  • Scalable Architecture: Worker-thread model managed by a central SQLite database queue.
  • Validation: Pre-verifies seed URLs to ensure only valid targets enter the pipeline.
  • Real-time Dashboard: Rich terminal UI (TUI) showing live progress, speed, and active threads.

Architecture Overview

The system operates on a Manager-Worker model orchestrated by main.py and backed by a SQLite database (avature_scraper.db).

  1. The Database: Acts as the central brain. It stores:
    • Sites: Known Avature domains.
    • Job Queue: State machine for URLs (PENDING -> PROCESSING -> COMPLETED).
    • Jobs: Extracted data.
  2. Manager: The main.py process initializes the session, launches threads, and runs the TUI.
  3. Workers:
    • DiscoveryWorker: Scans known sites for job listings.
    • DorkingDiscoveryWorker: Searches the web for new Avature sites.
    • ScraperWorker: Processes specific job pages to extract details.
    • SeedLoader: Ingests and validates external URL lists.

Usage

1. Basic Run (Auto-Discovery)

To start the system in default mode. It will look for the database, create a new session, and start discovering sites automatically.

python main.py

2. Importing a Seed List

If you have a large list of potential URLs (e.g., Urls.txt), the system can verify and ingest them.

python main.py --seed Urls.txt --workers 10
  • Note: The system runs a SeedLoader thread that pre-verifies links (checking for 200 OK) before adding them to the scrape queue to save resources.

3. Scaling Workers

Adjust the number of threads based on your machine's capability and network bandwidth.

python main.py --workers 20 --discovery-workers 5

4. Recovery Options

  • --session <ID>: Resume a specific session.
  • --reprocess: Reset all COMPLETED and FAILED jobs to PENDING to re-scrape them.
  • --reset-stale: Unlock jobs that were stuck in PROCESSING status due to a previous crash.
  • --rediscover: Force the system to re-scan all known sites for new links.

Component Deep Dive

main.py

The entry point. It sets up the GlobalStats object (for the TUI), initializes the Database, and spawns the thread pool. It handles the graceful shutdown and orchestrates the different worker types.

scraper.py

The core intelligence of the project.

  • AvatureScraper Class:
    • Discovery: It attempts to reverse-engineer the Avature "Instant Search" API to get JSON results directly. If that fails, it falls back to parsing sitemap.xml, RSS feeds, and finally HTML scraping.
    • Extraction: Parses job pages to extract title, description, location, posted_date, and dynamic fields. It uses BeautifulSoup and supports extracting metadata from URL slugs and JSON-LD tags.
  • HotDomainTracker: A sophisticated singleton that tracks domains returning HTTP 406/429. It blocks all workers from accessing a "hot" domain for a cooldown period (starting at 5 minutes), automatically preventing IP bans.

discovery.py & active_discovery.py

Responsible for finding new Avature portals.

  • Methods:
    • CRT.sh: Queries public SSL certificate logs for %.avature.net.
    • Dorking: Runs search queries (e.g., site:avature.net/careers) on Bing, DuckDuckGo, and optionally Google.
    • Candidates: Checks a built-in list of Fortune 500 companies.

database.py

A robust SQLite wrapper using WAL (Write-Ahead Logging) mode for high concurrency.

  • Queue Logic: Uses atomic transactions (BEGIN IMMEDIATE) to claim batches of jobs for workers, ensuring no two workers process the same link.
  • Storage: Uses an EAV (Entity-Attribute-Value) schema (job_attributes table) to store flexible job metadata without needing schema migrations for every new field.

verifier.py

A high-speed, multi-threaded link checker. It uses HEAD/GET requests to validate URLs before they enter the database. It filters out 404s and soft-404s early in the pipeline.

export_json.py

The tool for getting data out of the system.

Usage:

# Export everything from the most recent session
python export_json.py

# Export specific session
python export_json.py --session Session_123abc

# Export only NEW jobs that haven't been exported before
python export_json.py --new-only

# Export to a custom file
python export_json.py -o my_data.json

Features:

  • Standardizes fields (maps job_location, primary_location -> location).
  • Marks records as exported in the DB to support incremental dumps.
  • Filters out "Unknown" or empty values.

Data Structure

The exported JSON follows this structure:

[
  {
    "title": "Software Engineer",
    "url": "https://careers.google.com/jobs/...",
    "description": "Full job description text...",
    "location": "Mountain View, CA",
    "work_type": "Hybrid",
    "posted_date": "2025-01-15",
    "source_site": "https://careers.google.com",
    "custom_field_1": "value",
    "custom_field_2": "value"
  }
]

Troubleshooting

  1. "Connection pool is full":
    • This is a warning from urllib3 when threads exceeds connection pool size. The system handles this gracefully, but you can reduce --workers if it spams logs.
  2. Rate Limiting (406 Errors):
    • The system automatically detects this. You will see "Domain marked HOT" in the logs. The scraper will back off from that specific domain for 5-30 minutes. Do not restart the program; let it run, and it will retry later.
  3. No Jobs Found:
    • Ensure the discovery workers are running.
    • Check logs/ for detailed error messages.
    • Verify your IP hasn't been blocked by search engines (for dorking).

About

Avature Job Scraper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages