GitHub Actions pipelines that scrape job boards (LinkedIn, Indeed, Glassdoor, ZipRecruiter, Google Jobs, HiringCafe, USAJOBS, NEOGOV, CalOpps, CalCareers, CSU Careers) on a schedule, commit the results to the repo, and surface them in a single filterable triage.html dashboard hosted free on GitHub Pages β with a map, salary harmonization, cross-source de-duplication, notes, bulk workflow states, CSV export, application-packet prompts, and optional phone notifications. No server, no paid services, and no API keys required.
Everything you search for lives in one file: config.json β point it at your field and locations (or generate it from your CV with an LLM) and you have your own tracker. Live example: scottcoff.in/Job_Scraper/triage.html.
This repo ships configured for environmental / toxicology roles (Dr. Scott Coffin's field β scottcoff.in) as a worked example, and began as Ernesto Diaz's Bay Area ML-engineer scraper. The walkthrough below sets up your own copy from scratch.
You only need a free GitHub account. Everything runs on GitHub's servers (Actions + Pages) β you don't have to install anything or keep a computer on. (Local install is optional; see Running locally.)
CLI shortcut (Steps 3β5 in one command): If you have GitHub CLI installed, clone your fork locally and run:
bash scripts/setup.shIt enables Actions, sets the required variable, enables Pages, and walks you through optional credentials interactively. New to GitHub? Follow the full steps below β no CLI needed.
Click Fork at the top of this page.
Your personal config (config.json, scoring_profile.json) and all scraped data (output/) are gitignored in this upstream repo, so syncing will never overwrite your customizations.
Enable the sync_upstream.yml workflow in your repo (Actions β Sync from upstream β Enable workflow) and it rebases new code improvements every Monday.
Use the workflow, not the GitHub "Sync fork" button. Because your fork has commits upstream doesn't (your
config.json, your scraped data), GitHub's built-in button shows "Discard N commits" β which would delete your config. Thesync_upstream.ymlworkflow handles this correctly by rebasing your commits on top of upstream. The button is safe only before you've committed any personalization.
Always pull from
https://github.com/ScottCoffin/Job_Scraperβ never from someone else's personal fork.
Clone an existing copy to your computer
git clone https://github.com/YOUR-USERNAME/YOUR-REPO.git
cd YOUR-REPOYou don't need to clone just to configure it β you can edit config.json directly on github.com.
This is the only file you need to change. Pick one:
First: create config.json in your repo. On GitHub: click Add file β Create new file, name it config.json, then copy the contents of config.example.json in, edit, and commit. Do not delete or rename config.example.json β keeping it lets the sync workflow pull upstream improvements to the example without conflict.
A. Generate it from your CV (no coding). Open docs/cv-to-config-prompt.md, copy the prompt, and paste it into ChatGPT, Claude, or any chatbot together with your CV and your target locations. It returns a finished config.json ready to commit.
B. Edit by hand. config.example.json is fully self-documenting. The two things almost everyone change: keywords.include + search_terms (what roles) and locations (where; LinkedIn geoId can be left "").
Optional knobs: profile (dashboard title/subtitle), keywords.exclude, employers.priority / employers.exclude, priority_topics (β highlights), role_categories (the Role-filter buckets), and per-source search_terms / locations.
This publishes triage.html at a free public URL.
- In your repo: Settings β Pages.
- Under Build and deployment β Source, choose Deploy from a branch.
- Branch:
main, folder:/ (root)β Save. - After ~1 minute your dashboard is live at:
https://YOUR-USERNAME.github.io/YOUR-REPO/triage.html
New to Pages? GitHub's 2-minute guide: Creating a GitHub Pages site.
Want it on a custom domain (like you.com/jobs)? See Managing a custom domain.
-
Open the Actions tab β click "I understand my workflows, enable them."
-
Settings β Actions β General β Workflow permissions β select Read and write permissions β Save.
-
Settings β Secrets and variables β Actions β Variables tab β add a new variable:
Variable Value ENABLE_DATA_COMMITStrueThis tells CI to commit scraped results back to your repo. Without it, scrapers run but nothing is saved. The upstream template repo deliberately leaves this unset so its CI never commits data that would conflict with your copy when you sync.
In the Actions tab, open each watcher and click Run workflow. Afterwards they run automatically on their schedule β this first manual run seeds your dataset.
One-time historical backfill (recommended for new setups):
Several watchers have a backfill toggle in the "Run workflow" dialog that pulls a longer historical window to give you a full initial picture:
| Watcher | Default window | Backfill window |
|---|---|---|
| LinkedIn Watcher | last 1 hour | last 30 days |
| Indeed Watcher | last 24 hours | last 50 days |
| Glassdoor Watcher | last 24 hours | last 30 days; scheduled runs are opt-in |
| ZipRecruiter Watcher | last 24 hours | last 30 days |
| Google Jobs Watcher | last 24 hours | last 30 days |
| HiringCafe Watcher | last 30 days | last 61 days |
| Priority Employer Digest | last 24 hours | last 30 days |
| Local & State Gov Watcher (NEOGOV) | last 21 days | last 60 days |
To use: Actions β [Watcher name] β Run workflow β check "One-time backfill" β Run workflow.
No backfill needed for CalCareers, USAJOBS, and CalOpps β these sources return all current open listings on every run, so a single normal run is already a full snapshot.
Give it 1β2 minutes per watcher, then open your β¦/triage.html URL. π Hard-refresh (ctrl+R) after each scrape to see new jobs.
Confirm everything is configured: Actions β Validate Setup β Run workflow. It prints a checklist of required and optional items and fails loudly if anything critical is missing.
Get a push the moment a relevant new role appears, via Pushover (a simple, one-time ~$5 app for iOS / Android; the API is free):
- Sign in at pushover.net, Create an Application/API Token (any name) β copy the API Token. Copy your User Key from the dashboard home.
- In your repo: Settings β Secrets and variables β Actions β New repository secret β add
PUSHOVER_TOKENandPUSHOVER_USER. - Test it: Actions β Test Pushover Notification β Run workflow β you should get a push within ~20 seconds.
- (Optional) Add a repository Variable
NOTIFY_MIN_FITto tune instant high-fit pings. - (Optional) Add repository Variable
WEEKLY_DIGEST_PUSHOVER=trueto receive the weekly Pushover brief. You can also setWEEKLY_DIGEST_DAYS(default7).
Without these secrets, notifications are simply off and everything else works.
triage_agent.py can score each role against your rΓ©sumΓ© with the Claude API (paid, ~pennies/run). It needs an ANTHROPIC_API_KEY secret plus your profile/rΓ©sumΓ© in secrets. Entirely optional β leave the triage.yml / evals.yml workflows disabled if you don't use it (Actions β workflow β β― β Disable).
Each source is a workflow in .github/workflows/. To stop one, Actions β that workflow β β― β Disable workflow. The dashboard simply skips any source file that doesn't exist, so nothing breaks.
Glassdoor is currently treated as an opt-in scheduled source because it is prone to upstream blocking and location-parse failures from shared GitHub Actions IPs. Manual Run workflow still works for testing. To schedule it, add repository Variable ENABLE_GLASSDOOR_WATCHER=true.
Optional β only if you want to test scrapes on your own machine. Needs Python 3.11+:
python scrape_jobs.py --linkedin-only # standard library only
python scrape_jobs.py --usajobs-only # standard library only
python scrape_jobs.py --hiringcafe-only # standard library only
pip install -r requirements.txt # JobSpy-backed boards
python scrape_jobs.py --indeed-only
python scrape_jobs.py --glassdoor-only
python scrape_jobs.py --ziprecruiter-only
python scrape_jobs.py --google-jobs-only
python -m http.server 8000 # then open http://localhost:8000/triage.htmlThe dashboard must be served over HTTP (the commands above) β opening triage.html from file:// won't load the data.
The rest of this README documents how it works, using the shipped environmental / toxicology example. Skim it to customize further; you don't need any of it to get running.
The descriptions below use this repo's shipped example config (environmental / toxicology; California, Oregon & Australia). Your locations, keywords, and employers come from
config.jsonβ see the walkthrough above.
Hits LinkedIn's public guest endpoint for roles in your configured locations posted in the last 24 hours, then post-filters to a priority-employer allowlist (employers.priority in config.json). Treat the shipped employer list as an example only: replace it with the companies, agencies, universities, nonprofits, labs, hospitals, startups, studios, or other organizations that matter in your own field. Add to that list to expand coverage.
Output goes to jobs.json, jobs.md, and jobs.html. Each run dedupes against the previously-committed jobs.json, so the output surfaces only postings new since the last run.
A direct-ATS probe path (
CURATED_BIOTECHS) also exists but is empty by default in the shipped example. It is useful only when your target employers expose job data through supported public ATS endpoints. The LinkedIn + JobSpy-backed keyword watchers (Indeed, Glassdoor, ZipRecruiter, and Google Jobs) are the primary sources for most users.
Hits LinkedIn's public guest endpoint for roles in your configured locations posted in the last hour across your search_terms, dedupes by job ID, and sorts by recency. Output goes to linkedin_jobs.json, linkedin_jobs.md, and linkedin_jobs.html.
Runs hourly at :17 PT (8amβ8pm) via native GitHub cron, with the in-repo watchdog (linkedin_watch_backup.yml at :33) re-dispatching missed slots. A block guard preserves the previous results when LinkedIn returns zero cards across every term (rate-limited run).
β οΈ Uses the unauthenticated public guest endpoint only β never signs in with a user account and does not use LinkedIn cookies, tokens, or credentials.
Uses python-jobspy (Indeed's RSS and Publisher API were deprecated in 2026 and the site sits behind Cloudflare; JobSpy uses Indeed's mobile-app API internally). Searches your configured locations. Output goes to indeed_jobs.json, indeed_jobs.md, and indeed_jobs.html, deduped against the previous run. Runs at :47 PT, offset from LinkedIn's :17 slot.
Also uses python-jobspy to add Glassdoor coverage without adding a second scraping stack. By default it reuses the same search terms and locations as Indeed unless you add search_terms.glassdoor or locations.glassdoor to config.json. Output goes to glassdoor_jobs.json, glassdoor_jobs.md, and glassdoor_jobs.html, deduped against the previous run. Glassdoor often returns 403 or "location not parsed" from shared CI IPs, so its scheduled job is opt-in with repository Variable ENABLE_GLASSDOOR_WATCHER=true; manual workflow dispatch remains available for testing.
For blocked JobSpy-backed sources, add a GitHub Actions secret named JOBSPY_PROXIES with a comma-separated proxy list accepted by JobSpy, and optionally set repository Variable JOBSPY_USER_AGENT. The scraper also reads the same values from jobspy.proxies and jobspy.user_agent in config.json, but secrets are safer for proxy credentials.
Uses python-jobspy for ZipRecruiter coverage. By default it searches the US locations from locations.indeed; set search_terms.ziprecruiter or locations.ziprecruiter to tune it separately. Output goes to ziprecruiter_jobs.json, ziprecruiter_jobs.md, and ziprecruiter_jobs.html, deduped against the previous run. Runs at :27 PT.
Uses JobSpy's Google Jobs adapter first, which keeps this repo free of paid proxy APIs and browser automation when Google still serves parseable job payloads. Google is different from the other JobSpy boards: the scraper builds full google_search_term strings such as toxicologist jobs near California since yesterday, because Google Jobs ignores JobSpy's generic search_term, location, and hours_old parameters. Configure with search_terms.google_jobs and locations.google_jobs, or set exact strings in google_jobs.queries.
If JobSpy returns zero raw rows across every query, the watcher can fall back to structured Google Jobs APIs. Add either SERPAPI_API_KEY or both OXYLABS_USERNAME and OXYLABS_PASSWORD as GitHub Actions secrets. The Oxylabs fallback follows their Google Jobs API pattern: q, ibp=htl;jobs, hl, and gl in the Google URL plus rendered parsing instructions. Output goes to google_jobs.json, google_jobs.md, and google_jobs.html, deduped against the previous run. Runs at :37 PT.
Searches HiringCafe's public SSR /jobs/<query> pages for direct-from-employer listings, because the old unauthenticated API endpoint now returns 401/405. Configure with search_terms.hiring_cafe and cap pagination with hiring_cafe.max_pages in config.json. The public SEO route currently defaults to United States results. If HiringCafe returns no rows or errors for every page, the scraper preserves the previous hiringcafe_jobs.* files instead of wiping the dashboard source.
A title is included if it matches the include terms generated from config.json. Multi-word phrases match as substrings; single tokens are word-bounded, so list full words. The shipped example uses environmental/toxicology terms like these; replace them with terms for your own domain:
Domain/core role examples: toxicologist, software engineer, product manager, grant writer, clinical research coordinator
Methods or specialty examples: risk assess, machine learning, regulatory affairs, clinical trials, financial modeling, curriculum design
Tools, products, or regulated-area examples: R Shiny, Salesforce, Good Clinical Practice, NEPA, SAP, Kubernetes, Adobe Creative Suite
Topic examples: microplastic, PFAS, cybersecurity, housing policy, oncology, renewable energy, early childhood education
Seniority or work-style examples: senior, principal, director, remote, hybrid, field, research, policy
The list is deliberately tight for precision: generic titles (research scientist, senior scientist, data scientist, professor, regulatory affairs) are usually too broad on their own. Pair broad words with your domain, method, tool, or organization context, for example environmental data scientist, healthcare data scientist, or assistant professor of environmental health.
Excluded everywhere: - Junior / training: intern, internship, co-op, trainee, apprentice, technician, research/lab/teaching assistant, undergraduate, postdoc, work-study, volunteer, fellowship. Keep or remove these based on the user's target career stage. - Adjacent-but-wrong families: add terms that are common false positives in your domain. In the shipped example, EHS/workplace-safety terms are excluded because they are adjacent to, but different from, the target environmental toxicology roles. In another domain this might be sales, customer support, bench research, finance, management-only roles, or another nearby category.
You define the locations in config.json β locations (no code edits). The shipped example searches California, Portland & Bend OR, and Australia, but it works for anywhere β add/remove entries to suit:
- LinkedIn β
locations.linkedin: each is alocation+ LinkedIngeoId. LeavegeoIdblank to let LinkedIn resolve the text (works for most cities/metros), or fill in the numeric id for tighter filtering. A geoId reference table is indocs/cv-to-config-prompt.md. - Indeed / Glassdoor / ZipRecruiter β each takes a
location+country(USA,Australia,GB,Canada, β¦). Glassdoor falls back to Indeed locations if omitted; ZipRecruiter defaults to the US Indeed locations. - Google Jobs β
locations.google_jobsis used to build the fullgoogle_search_termtext JobSpy requires; advanced users can bypass auto-building withgoogle_jobs.queries. - HiringCafe β currently uses HiringCafe's public US SEO search route; page depth is capped by
hiring_cafe.max_pages. - USAJOBS is nationwide US (federal); NEOGOV is filtered to your configured locations; CalCareers and CalOpps are California-only boards by nature (disable them if you're not searching California).
- The map and dashboard auto-fit to wherever your jobs are.
| File | Source | Description |
|---|---|---|
jobs.json / .md / .html |
Priority-employer digest | Allowlisted employer roles for your configured domain, last 24h, deduped against the previous run |
linkedin_jobs.json / .md / .html |
LinkedIn watcher | Roles in your configured locations, last 1h, deduped |
indeed_jobs.json / .md / .html |
Indeed watcher | Indeed-sourced roles in your locations, last 24h, deduped |
glassdoor_jobs.json / .md / .html |
Glassdoor watcher | Glassdoor-sourced roles in your locations, last 24h, deduped |
ziprecruiter_jobs.json / .md / .html |
ZipRecruiter watcher | ZipRecruiter-sourced roles in your locations, last 24h, deduped |
google_jobs.json / .md / .html |
Google Jobs watcher | Google Jobs roles in your locations, last 24h, deduped |
hiringcafe_jobs.json / .md / .html |
HiringCafe watcher | Direct-employer roles from HiringCafe, last 30d, guarded |
calcareers_jobs.json / .md / .html |
CalCareers watcher | California state civil-service roles (calcareers.ca.gov) |
csucareers_jobs.json / .md / .html |
CSU Careers watcher | California State University systemwide roles (csucareers.calstate.edu) |
usajobs_jobs.json / .md / .html |
USAJOBS watcher | US federal roles matching your configured keywords, with salary, via usajobs.gov |
governmentjobs_jobs.json / .md / .html |
NEOGOV watcher | State & local-gov roles matching your configured keywords via governmentjobs.com |
calopps_jobs.json / .md / .html |
CalOpps watcher | California local-agency roles (cities, counties, special districts) via calopps.org |
all_jobs.json |
accumulator | Cumulative 14-day master (feeds the dashboard + triage) |
scores.json |
triage agent | Optional fit verdicts keyed by job URL |
scrape_jobs.py --calcareers-only scrapes calcareers.ca.gov β the CA state civil-service portal. This is useful when your configured role terms overlap with California state classifications. CalCareers is an ASP.NET WebForms site with no public API, so the scraper seeds a session and fires the search postback (__EVENTTARGET=ctl00$cphMainContent$btnSearch with the keyword field), then parses the labeled result cards. The working postback method was adapted from the OpenPostings calcareers module. Fully guarded; runs daily via calcareers_watch.yml. Verified against the shipped example configuration.
scrape_jobs.py --csucareers-only scrapes csucareers.calstate.edu β the California State University systemwide PageUp listing. It walks the paginated listing table, keeps roles whose title or summary matches your configured keywords, and preserves the previous CSU output if the remote listing scan is incomplete. Runs daily via csucareers_watch.yml.
scrape_jobs.py --usajobs-only scrapes usajobs.gov β US federal roles matching your configured keywords, with salary. It uses the site's public search endpoint (/Search/ExecuteSearch), so no API key is required: it seeds a session, then POSTs each keyword and keeps titles that pass your configured filter. Runs daily via usajobs_watch.yml. Federal roles are nationwide; use the dashboard's location filter/map to focus.
Source identified from the OpenPostings project's catalog of 80+ ATS providers. OpenPostings is a self-hosted aggregator (not a hosted API), so rather than depend on it we query the official USAJOBS public endpoint directly.
Also added from the OpenPostings catalog β the boards that carry county/city roles LinkedIn and Indeed may miss:
--governmentjobs-only(governmentjobs.com / NEOGOV) β state & local agencies nationwide; keyword-searched and filtered to your configured locations.--calopps-only(calopps.org) β California local agencies (cities, counties, special & water districts). CA-only board, so it is title-filtered only.
Both are HTML scrapes (no API), fully guarded, and run daily via localgov_watch.yml. Local-government roles can be sparse, so yield is often low but high-signal when your target domain appears on public-sector boards.
The triage.html cockpit adds, on top of the source/role/seniority/date filters:
- β
Priority topics β roles touching your configured signature topics get a gold β
and a highlighted card; a toggle filters to just those. The shipped example uses microplastics, ecotoxicology, endocrine-disrupting chemicals, and R/Shiny. Edit
priority_topicsinconfig.jsonand the matching dashboard terms to change what's flagged. - Cross-source de-dup β the same role cross-posted to LinkedIn, Indeed, Glassdoor, ZipRecruiter, Google Jobs, HiringCafe, and public-sector boards collapses into one card using normalized job IDs first, then conservative title + company + location/content checks. Triage applies to all copies at once.
- β
Best fit view β ranks roles by match to the target user's specializations. The shipped example uses environmental/toxicology criteria, but you should replace those weights with criteria for your own domain. Weights live in
FIT_TERMSintriage.html; every card shows a 0β100 fit chip. - π« Not relevant button β hides a role and learns from it: titles sharing distinctive words with your "not relevant" marks are down-ranked in Best fit.
- Bulk triage + notes β select multiple visible roles, then mark them saved, applied, interview, offer, dismissed, or not relevant in one action. Each card also has a local note field for follow-up details, contacts, or deadlines.
- Company blocking β hide a noisy employer from this browser without editing config; the block list is stored locally with your triage state.
- Application packets β copy a per-job or bulk prompt containing job details, fit signals, notes, and instructions for tailoring resume bullets, a cover letter, and screening answers. It never auto-applies or submits credentials.
- Companion toolkit β links out to focused resume/application prep tools that make more sense as standalone helpers than as static-dashboard internals.
- CSV export β download the currently visible jobs, including status, notes, source badges, salary, remote/type metadata, and URLs.
- Keyboard triage β
/focuses search;j/kmoves the focused card;xselects it;s,a,i, anddmark saved/applied/interview/dismissed;pcopies an application packet;oopens the focused job;nfocuses notes;1/2/3switch Browse/Best fit/Map;?shows the shortcut hint. - Salary slider β harmonizes inconsistent pay formats (hourly, monthly, yearly,
$kranges, title-embedded) to an annual figure, then filters by a minimum, with an "include unlisted" toggle. - πΊ Map view β Leaflet map of roles by city (client-side geocoding, no API key) that auto-fits to wherever your jobs are; hover a dot for the location, click for the roles. Remote/unknown roles cluster at a default center.
A single-file dashboard hosted on GitHub Pages that merges the latest source JSONs into one filterable cockpit: search; source / role / seniority filters (role buckets come from config.json); save / applied / interview / offer / dismiss / not-relevant triage persisted in localStorage; per-job notes; company blocking; CSV export; application packets; bulk actions; keyboard shortcuts; and top-companies, role-mix, and salary charts.
View it (after enabling Pages β see Deployment): https://scottcoffin.github.io/Job_Scraper/triage.html
The dashboard fetches the JSON files from the same repo at view time, so it always reflects the latest committed scrape. To run locally:
python -m http.server 8000
# then visit http://localhost:8000/triage.htmlOpening from file:// won't work β the dashboard needs same-origin HTTP to fetch() the source JSONs.
From the Actions tab β Run workflow on any watcher, or locally:
python scrape_jobs.py --biotech-only # priority-employer digest (allowlist)
python scrape_jobs.py --linkedin-only # general LinkedIn, last 1h
python scrape_jobs.py --indeed-only # general Indeed, last 24h
python scrape_jobs.py --glassdoor-only # general Glassdoor, last 24h
python scrape_jobs.py --ziprecruiter-only # general ZipRecruiter, last 24h
python scrape_jobs.py --google-jobs-only # Google Jobs, last 24h
python scrape_jobs.py --hiringcafe-only # HiringCafe, last 30d
python scrape_jobs.py --usajobs-only # US federal jobs (usajobs.gov, no API key)
python scrape_jobs.py --governmentjobs-only # state/local gov (NEOGOV)
python scrape_jobs.py --calopps-only # California local agencies (calopps.org)
python scrape_jobs.py --calcareers-only # California state jobs (calcareers.ca.gov)
python scrape_jobs.py --csucareers-only # California State University jobs (csucareers.calstate.edu)The LinkedIn / priority / HiringCafe / USAJOBS / gov pipelines use only the Python standard library. Indeed, Glassdoor, ZipRecruiter, and Google Jobs use one optional dependency: pip install -r requirements.txt (single package, python-jobspy).
Quick setup is in the walkthrough Step 6; this is the detail.
Get a push to your phone the moment a highly-relevant new role appears. After each scrape, notify.py pushes any new posting that either touches a priority topic from your configuration or scores β₯ NOTIFY_MIN_FIT (default 75) on the resume-fit model. It dedupes against notified.json, so the same role is never pushed twice (across sources or runs). Priority-topic hits ping at high priority. The shipped example's priority topics are environmental/toxicology-specific placeholders; replace them with the topics that signal an unusually good match in your domain.
To enable, add these in Settings β Secrets and variables β Actions:
| Secret | Value |
|---|---|
PUSHOVER_TOKEN |
Your Pushover application/API token (create an app at pushover.net) |
PUSHOVER_USER |
Your Pushover user key (top of your pushover.net dashboard) |
Optional Variable (not secret): NOTIFY_MIN_FIT β lower than 75 for more (less selective) pings, higher for fewer. Without the two secrets, notifications are simply off (everything else still works).
Weekly brief: the Weekly Job Digest workflow runs Monday morning and is off by default. To opt in, add repository Variable WEEKLY_DIGEST_PUSHOVER=true. The brief reads all_jobs.json for roles first seen in the last 7 days, groups them by salary band and organization, and includes a few standouts ranked by scores.json when the optional triage agent has run. If scores.json is absent, it falls back to the same deterministic resume-fit scorer used for instant Pushover alerts, so no LLM is required. Optional variables:
| Variable | Value |
|---|---|
WEEKLY_DIGEST_PUSHOVER |
true to enable the scheduled weekly brief |
WEEKLY_DIGEST_DAYS |
Lookback window; default 7 |
DASHBOARD_URL |
Override the link attached to the push |
The weekly digest does not need an LLM at send time. When scores.json is empty, it uses deterministic criteria in notify.py (FIT_TERMS, SIGNATURE_TERMS, and POOR_FIT_TERMS) to pick the closest matches. Calibrate those criteria from a real gold-standard duty statement before trusting the fallback. A gold-standard role is the kind of posting that should be treated as a perfect match for the target user and score 100.
Beginner workflow:
- Collect examples. You do not need to read or paste code.
- 1-3 perfect-fit job descriptions or duty statements that should score
100. - 5-10 good-fit jobs that should score roughly
70-89. - 10-20 false positives that should score below
25. - Your CV/resume, or a short profile of your target roles.
- 1-3 perfect-fit job descriptions or duty statements that should score
- Paste the "simple calibration prompt" below into your preferred LLM.
- In GitHub, create or edit a file named
scoring_profile.jsonat the repo root. - Paste the LLM's JSON output into that file and commit it.
- Run
python notify.py --weekly-digest --dry-runor manually dispatch the weekly digest workflow and check whether the listed matches look right.
Simple calibration prompt:
You are helping calibrate job-fit scoring for a job scraper. I am a non-technical
user. Output only valid JSON that I can paste directly into a file named
scoring_profile.json. Do not include markdown fences, comments, prose, or
trailing commas.
Goal:
- A job matching the GOLD-STANDARD DUTY STATEMENT should score 100/100.
- A strong adjacent role should score 70-89.
- A plausible but generic adjacent role should score 35-59.
- A poor-fit role should score below 25 even if it contains broad words like
<<<PASTE 5-10 BROAD DOMAIN WORDS THAT CREATE FALSE POSITIVES HERE>>>.
Candidate profile/CV:
<<<PASTE CV OR RESUME TEXT HERE>>>
Gold-standard 100/100 duty statement:
<<<PASTE DUTY STATEMENT TEXT HERE>>>
Gold-standard summary, if useful:
<<<PASTE A SHORT DESCRIPTION OF WHY THIS ROLE SHOULD SCORE 100, E.G. "This role
combines [domain], [methods/tools], [seniority], [organization type], and
[work products] that exactly match the target user.">>>
Optional negative examples:
<<<PASTE JOB TITLES/DESCRIPTIONS THAT SHOULD NOT BE STANDOUTS HERE>>>
Task:
1. Extract the exact positive scoring dimensions from the gold-standard duty
statement. Separate must-have signals from nice-to-have signals.
2. Identify broad terms that create false positives and should not score highly
by themselves.
3. Identify job families, industries, seniority levels, or task types that should
be penalized.
4. Return JSON using exactly this shape:
{
"version": 1,
"description": "Short non-private description of this scoring profile.",
"settings": {
"title_multiplier": 3,
"body_multiplier": 1,
"score_multiplier": 1.6,
"generic_cap": 35,
"standout_threshold": 60
},
"fit_terms": [
{"pattern": "specific positive phrase|another positive phrase", "weight": 12}
],
"signature_terms": [
"regex for evidence that this is truly candidate-specific"
],
"poor_fit_terms": [
{"pattern": "false positive phrase|wrong job family", "penalty": 35}
],
"test_cases": [
{
"title": "Gold-standard role title",
"company": "Example organization",
"description": "Short excerpt or summary",
"expected_score_range": [100, 100],
"rationale": "Why this should score 100"
}
]
}
5. Include the gold-standard role as a test case with expected score range [100, 100].
6. Use JSON strings for regex patterns. Escape backslashes as needed for valid
JSON, for example "\\bword\\b".
Important calibration requirements:
- The gold-standard role must score exactly 100.
- Broad domain terms must not produce high scores by themselves. They should
require pairing with candidate-specific evidence such as target methods,
tools, subject matter, seniority, organization type, regulated domain,
deliverables, or work products.
- Generic adjacent jobs, wrong-seniority jobs, wrong-industry jobs, and roles
with misleading keyword overlap should be penalized unless the description
contains strong candidate-specific evidence.
- If there are no roles above 60, the digest should label them as closest
matches rather than standouts.
Output format:
- Output only the JSON object for scoring_profile.json.
- Do not include private CV details in public-facing fields such as description
or rationale.
Test it (sends one push to your phone): - From GitHub (recommended): Actions β Test Pushover Notification β Run workflow. Uses your Actions secrets, so it confirms the real setup. The run log prints whether the keys are set and the exact Pushover API response on failure (e.g. a bad token/user key). - Weekly digest dry run: python notify.py --weekly-digest --dry-run - Locally: bash PUSHOVER_TOKEN=xxx PUSHOVER_USER=yyy python notify.py --test
triage_agent.py scores each new role against your profile with the Claude API. It is optional and needs three repo secrets (Settings β Secrets and variables β Actions):
| Secret | Value |
|---|---|
ANTHROPIC_API_KEY |
Anthropic API key |
CANDIDATE_PROFILE |
Short profile text (your background/targets β kept out of the public repo) |
CANDIDATE_RESUME |
Resume / CV text (kept out of the public repo) |
Paste your CV text into CANDIDATE_RESUME. Without these secrets, leave triage.yml and evals.yml disabled (Actions β β― β Disable workflow) β the scrapers and dashboard work fully without them; scores.json is optional.
Note:
eval_triage.pystill contains the original ML-candidate golden cases. They only matter if you run the triage agent; rewrite them for your domain (or keepevals.ymldisabled) once you've finalized your profile.
βββ config.example.json # β Template config β copy to config.json and edit
βββ scoring_profile.example.json # Template scoring profile β copy to scoring_profile.json
βββ config.json # YOUR settings (gitignored; not committed upstream)
βββ scoring_profile.json # YOUR scoring weights (gitignored; not committed upstream)
βββ triage.html # Interactive dashboard (served by GitHub Pages)
βββ scrape_jobs.py # All scraping logic (reads config.json)
βββ notify.py # Pushover notifications (optional)
βββ triage_agent.py # Optional nightly fit-scoring agent (Claude API)
βββ eval_triage.py # Golden-case evals for the triage agent
βββ requirements.txt # python-jobspy (Indeed, Glassdoor, ZipRecruiter, Google Jobs)
βββ output/ # Scraped data β gitignored upstream, populated by your CI
β βββ jobs.{json,md,html} # Priority-employer digest (last 24h)
β βββ linkedin_jobs.{json,md,html}
β βββ indeed_jobs.{json,md,html}
β βββ glassdoor_jobs.{json,md,html}
β βββ ziprecruiter_jobs.{json,md,html}
β βββ google_jobs.{json,md,html}
β βββ hiringcafe_jobs.{json,md,html}
β βββ calcareers_jobs.{json,md,html}
β βββ usajobs_jobs.{json,md,html}
β βββ governmentjobs_jobs.{json,md,html}
β βββ calopps_jobs.{json,md,html}
β βββ all_jobs.json # Cumulative 14-day master (feeds dashboard + triage)
β βββ scores.json # Triage verdicts (optional)
β βββ notified.json # Push-notification dedup log
β βββ workflow_runs.jsonl # CI run audit log
βββ docs/
β βββ cv-to-config-prompt.md # LLM prompt to generate config.json from a CV
β βββ triage.gif # Dashboard demo
βββ .github/workflows/
βββ scrape_jobs.yml # Daily β priority-employer digest
βββ linkedin_watch.yml # Hourly :17 PT β general LinkedIn (last 1h)
βββ indeed_watch.yml # Hourly :47 PT β Indeed (last 24h)
βββ glassdoor_watch.yml # Hourly :07 PT β Glassdoor (last 24h)
βββ ziprecruiter_watch.yml # Hourly :27 PT β ZipRecruiter (last 24h)
βββ google_jobs_watch.yml # Hourly :37 PT β Google Jobs (last 24h)
βββ hiringcafe_watch.yml # Hourly :57 PT β HiringCafe (last 30d)
βββ calcareers_watch.yml # Daily β CalCareers (California state jobs)
βββ usajobs_watch.yml # Daily β USAJOBS (federal jobs, no API key)
βββ localgov_watch.yml # Daily β NEOGOV + CalOpps (state & local gov)
βββ linkedin_watch_backup.yml # Watchdog :33 PT β re-dispatches missed runs
βββ weekly_digest.yml # Weekly β optional Pushover summary brief
βββ triage.yml # Nightly β optional fit scoring (needs secrets)
βββ evals.yml # Triage-agent evals (optional)
βββ sync_upstream.yml # Weekly β auto-merge code updates from upstream
Everything you'd adjust lives in config.json (no code edits) β the scraper and dashboard both read it: - keywords.include β title-match terms Β· keywords.exclude β titles to drop. - search_terms.* β queries sent to LinkedIn, Indeed, Glassdoor, ZipRecruiter, Google Jobs, and HiringCafe. - locations.* β LinkedIn geoId, JobSpy location + country, and Google query location text. - google_jobs.queries β optional exact Google Jobs search-box strings; Google Jobs API fallback credentials should be stored as GitHub Actions secrets. - jobspy β optional proxies/user-agent for blocked JobSpy-backed boards; prefer GitHub Actions secrets for proxy credentials. - hiring_cafe β optional page-depth guardrail. - employers.priority (allowlist for the digest) / employers.exclude (drop). - priority_topics (β highlights) Β· role_categories (Role-filter buckets) Β· profile (dashboard + digest branding).
Generate the whole file from your CV with docs/cv-to-config-prompt.md, or edit it by hand (every key is commented).
