Avature Scraper System

A high-performance, multi-threaded scraping engine designed specifically to discover and harvest job data from Avature-hosted career portals. This system features active site discovery, robust rate-limiting handling, and a persistent queue-based architecture.

Key Features

Active Discovery: Automatically finds new Avature career sites using Google/Bing dorking, subdomains, and Certificate Transparency logs (crt.sh).
Smart Scraping: Uses multiple strategies to extract job links (API interception, RSS feeds, Sitemaps, HTML fallback).
Resilience: Intelligent handling of HTTP 429/406 rate limits with "Hot Domain" tracking and exponential backoff.
Scalable Architecture: Worker-thread model managed by a central SQLite database queue.
Validation: Pre-verifies seed URLs to ensure only valid targets enter the pipeline.
Real-time Dashboard: Rich terminal UI (TUI) showing live progress, speed, and active threads.

Architecture Overview

The system operates on a Manager-Worker model orchestrated by main.py and backed by a SQLite database (avature_scraper.db).

The Database: Acts as the central brain. It stores:
- Sites: Known Avature domains.
- Job Queue: State machine for URLs (PENDING -> PROCESSING -> COMPLETED).
- Jobs: Extracted data.
Manager: The main.py process initializes the session, launches threads, and runs the TUI.
Workers:
- DiscoveryWorker: Scans known sites for job listings.
- DorkingDiscoveryWorker: Searches the web for new Avature sites.
- ScraperWorker: Processes specific job pages to extract details.
- SeedLoader: Ingests and validates external URL lists.

Usage

1. Basic Run (Auto-Discovery)

To start the system in default mode. It will look for the database, create a new session, and start discovering sites automatically.

python main.py

2. Importing a Seed List

If you have a large list of potential URLs (e.g., Urls.txt), the system can verify and ingest them.

python main.py --seed Urls.txt --workers 10

Note: The system runs a SeedLoader thread that pre-verifies links (checking for 200 OK) before adding them to the scrape queue to save resources.

3. Scaling Workers

Adjust the number of threads based on your machine's capability and network bandwidth.

python main.py --workers 20 --discovery-workers 5

4. Recovery Options

--session <ID>: Resume a specific session.
--reprocess: Reset all COMPLETED and FAILED jobs to PENDING to re-scrape them.
--reset-stale: Unlock jobs that were stuck in PROCESSING status due to a previous crash.
--rediscover: Force the system to re-scan all known sites for new links.

Component Deep Dive

`main.py`

The entry point. It sets up the GlobalStats object (for the TUI), initializes the Database, and spawns the thread pool. It handles the graceful shutdown and orchestrates the different worker types.

`scraper.py`

The core intelligence of the project.

AvatureScraper Class:
- Discovery: It attempts to reverse-engineer the Avature "Instant Search" API to get JSON results directly. If that fails, it falls back to parsing sitemap.xml, RSS feeds, and finally HTML scraping.
- Extraction: Parses job pages to extract title, description, location, posted_date, and dynamic fields. It uses BeautifulSoup and supports extracting metadata from URL slugs and JSON-LD tags.
HotDomainTracker: A sophisticated singleton that tracks domains returning HTTP 406/429. It blocks all workers from accessing a "hot" domain for a cooldown period (starting at 5 minutes), automatically preventing IP bans.

`discovery.py` & `active_discovery.py`

Responsible for finding new Avature portals.

Methods:
- CRT.sh: Queries public SSL certificate logs for %.avature.net.
- Dorking: Runs search queries (e.g., site:avature.net/careers) on Bing, DuckDuckGo, and optionally Google.
- Candidates: Checks a built-in list of Fortune 500 companies.

`database.py`

A robust SQLite wrapper using WAL (Write-Ahead Logging) mode for high concurrency.

Queue Logic: Uses atomic transactions (BEGIN IMMEDIATE) to claim batches of jobs for workers, ensuring no two workers process the same link.
Storage: Uses an EAV (Entity-Attribute-Value) schema (job_attributes table) to store flexible job metadata without needing schema migrations for every new field.

`verifier.py`

A high-speed, multi-threaded link checker. It uses HEAD/GET requests to validate URLs before they enter the database. It filters out 404s and soft-404s early in the pipeline.

`export_json.py`

The tool for getting data out of the system.

Usage:

# Export everything from the most recent session
python export_json.py

# Export specific session
python export_json.py --session Session_123abc

# Export only NEW jobs that haven't been exported before
python export_json.py --new-only

# Export to a custom file
python export_json.py -o my_data.json

Features:

Standardizes fields (maps job_location, primary_location -> location).
Marks records as exported in the DB to support incremental dumps.
Filters out "Unknown" or empty values.

Data Structure

The exported JSON follows this structure:

[
  {
    "title": "Software Engineer",
    "url": "https://careers.google.com/jobs/...",
    "description": "Full job description text...",
    "location": "Mountain View, CA",
    "work_type": "Hybrid",
    "posted_date": "2025-01-15",
    "source_site": "https://careers.google.com",
    "custom_field_1": "value",
    "custom_field_2": "value"
  }
]

Troubleshooting

"Connection pool is full":
- This is a warning from urllib3 when threads exceeds connection pool size. The system handles this gracefully, but you can reduce --workers if it spams logs.
Rate Limiting (406 Errors):
- The system automatically detects this. You will see "Domain marked HOT" in the logs. The scraper will back off from that specific domain for 5-30 minutes. Do not restart the program; let it run, and it will retry later.
No Jobs Found:
- Ensure the discovery workers are running.
- Check logs/ for detailed error messages.
- Verify your IP hasn't been blocked by search engines (for dorking).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
active_discovery.py		active_discovery.py
analyze_site.py		analyze_site.py
database.py		database.py
debug_api_integration.py		debug_api_integration.py
debug_delta_scrape.py		debug_delta_scrape.py
debug_extract_config.py		debug_extract_config.py
debug_extraction.py		debug_extraction.py
diagnose_access.py		diagnose_access.py
discovery.py		discovery.py
export_json.py		export_json.py
log_manager.py		log_manager.py
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
quick_run.py		quick_run.py
scraper.py		scraper.py
test_api.py		test_api.py
test_discovery.py		test_discovery.py
test_filter_api.py		test_filter_api.py
test_instant_search.py		test_instant_search.py
test_path_flexibility.py		test_path_flexibility.py
test_rate_limiter.py		test_rate_limiter.py
test_rate_limits.py		test_rate_limits.py
test_robots.py		test_robots.py
test_rss_extraction.py		test_rss_extraction.py
verifier.py		verifier.py
verify_system.py		verify_system.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Avature Scraper System

Key Features

Architecture Overview

Usage

1. Basic Run (Auto-Discovery)

2. Importing a Seed List

3. Scaling Workers

4. Recovery Options

Component Deep Dive

`main.py`

`scraper.py`

`discovery.py` & `active_discovery.py`

`database.py`

`verifier.py`

`export_json.py`

Data Structure

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

Bitwise-01/Avature

Folders and files

Latest commit

History

Repository files navigation

Avature Scraper System

Key Features

Architecture Overview

Usage

1. Basic Run (Auto-Discovery)

2. Importing a Seed List

3. Scaling Workers

4. Recovery Options

Component Deep Dive

main.py

scraper.py

discovery.py & active_discovery.py

database.py

verifier.py

export_json.py

Data Structure

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`main.py`

`scraper.py`

`discovery.py` & `active_discovery.py`

`database.py`

`verifier.py`

`export_json.py`

Packages