Data Sources Catalogue¶
Operational reference for every place we pull jobs from. For each source: what's in the tile, what needs a detail-fetch, auth requirements, anti-bot characteristics, latency, known quirks, and which scoring track applies.
Last updated: 2026-06-03.
Quick reference table¶
| Source | File | Render | Auth needed | Description in tile? | Cadence | Scoring track |
|---|---|---|---|---|---|---|
| Upwork | upwork.py |
SSR via patchright (logged-in profile) | Yes - Freelancer account | ✅ full | 15 min | UPWORK |
linkedin.py |
guest-mode search + detail-fetch | No for search; detail fetches behind authwall | ❌ tile-only (title + company + location + date) | 120 min | NON-UPWORK | |
| Indeed | indeed.py |
via FlareSolverr sidecar | No (FlareSolverr passes Cloudflare) | ✅ snippet, full description via detail fetch | 60 min | NON-UPWORK |
| RemoteRocketship | remoterocketship.py |
headed Chromium under Xvfb (Cloudflare) | No | 🟡 thin (title + company + location + seniority + funding) | 120 min | NON-UPWORK |
| Consortia | consortia.py |
JS-rendered, settle-and-parse | No | 🟡 snippet only | 120 min | NON-UPWORK |
| IntelligentPeople | intelligentpeople.py |
JS-rendered, skips filled | No | 🟡 snippet, full salary + location | 120 min | NON-UPWORK |
| Welcome to the Jungle | (planned) | JS-rendered, requires login | Yes - WTTJ account | ❓ TBD | TBD | NON-UPWORK |
| Generic engine (CWJobs, CV-Library, Built In London, Hays, Robert Walters, Michael Page, Sphere, Cranberry Panda, Jobserve) | generic.py |
per-site config: http or browser | Per-site | 🟡 varies by config | 240 min | NON-UPWORK |
| Detail-fetch fallback | detail_fetcher.py |
httpx + BeautifulSoup | No | n/a (post-tile enrichment) | on-demand | n/a |
Legend: ✅ full description / 🟡 partial or snippet / ❌ tile only
Per-source nuances¶
Upwork¶
- URL pattern:
https://www.upwork.com/nx/find-work/<topic_id>(NOT/nx/search/jobs/?topic_id=<id>— that's Cloudflare-protected and we can't pass it without JS). - Why we use find-work, not search/jobs: the
/nx/find-work/route returns 2 SSR'd job tiles per request and is NOT Cloudflare-protected. The/nx/search/jobs/route is behind an interactive challenge that kills our session. - Profile required: Freelancer account (set during
browser_login.py upwork- pick FREELANCER on the account-selection screen, never Client or Agency). Theupwork-clientprofile exists separately for talent-search scraping (mining top freelancers). - Description quality: full, included in tile.
- Anti-bot: mild on find-work; aggressive on search/jobs and talent-search. Cookies expire — re-run
browser_login.py upworkif scraper logsserved the account-selection screen. - Quirks: Upwork auto-appends "I am willing to pay higher rates for the most experienced freelancers" when client picks Expert level — IGNORE as client voice in scoring.
- Per-search throttling: 15-min default cadence; some searches have low daily volume.
LinkedIn¶
- URL pattern (search):
https://www.linkedin.com/jobs/search/?...withgeoId+keywords+f_WT=2(remote) parameters. NOT/jobs/search-results/(forces login). - URL pattern (detail):
https://uk.linkedin.com/jobs/view/<slug>-<id>. Auth-walled if accessed too aggressively. - Tile contents: title, company, location, posted date. NO description.
- Detail fetch: required for any meaningful scoring. Lives in
linkedin.py:fetch_detail(). Returns{description, budget, criteria}. - Anti-bot: STRONG. Detail-page fetches behind a per-IP rate limit; aggressive fetching →
/authwallredirect. Defence: - 30-60s jitter between detail fetches (in
JITTER_BETWEEN_DETAILS_S). - Authwall latch: if
/authwallis hit, the scraper setsself._authwalled=Trueand short-circuits all further detail fetches for the rest of the run. - Throwaway browser profile per LinkedIn request (no persisted cookies).
- Volume cadence: 120 min default. Anything more aggressive risks /authwall on the IP.
- Per-region searches: geoId 101165590 = United Kingdom, 105646813 = Spain, 105512687 = Valencia.
- Quirks:
- "Easy Apply" tag = higher volume / lower quality.
- Listing tile doesn't include description; if our detail-fetch is in throttle backoff, the job stays in the unscored queue until the next cycle.
- No cap on detail-fetches per run as of 2026-06-03 (the old
--max-li-details 30was removed). Rely on jitter + authwall-latch. - Backfill:
scripts/backfill_descriptions.py --source linkedinfor historical empty descriptions.
Indeed¶
- URL pattern:
https://uk.indeed.com/jobs?q=title%3A%22<term>%22&l=&sc=0kf%3Aattr(DSQF7)%3B&from=searchOnDesktopSerpfor UK remote;https://es.indeed.com/jobs?...for Spain. - Tile contents: title, company, location, salary if present, brief snippet.
- Detail fetch:
indeed.py:fetch_description(). Returns full body text. - Anti-bot: Cloudflare Turnstile. Our defence is a FlareSolverr sidecar container (
FLARESOLVERR_URLenv var, defaulthttp://flaresolverr:8191/v1). Each fetch takes 5-15s because FlareSolverr's own browser waits out the challenge. - Volume cadence: 60 min default (longer than Upwork due to slow FlareSolverr cycle).
- Quirks:
- Indeed silently changes filter URLs occasionally — if a source returns 0 over several runs, check the saved URL still resolves.
- UK Indeed has many duplicate listings (agencies re-posting); dedup by
source_idhandles most. - Spain Indeed → only score English-language ads per criteria.
RemoteRocketship¶
- URL pattern:
https://www.remoterocketship.com/country/<slug>/jobs/<role-slug>/?...&isOnLinkedIn=false. - Why the
isOnLinkedIn=falsefilter: specifically catches jobs NOT on LinkedIn (so we de-duplicate against our LinkedIn searches). - Tile contents: title, company, location, seniority, employment type, funding stage. NO full description.
- Detail fetch: can use
detail_fetcher.fetch_description_http()(plain HTTP - no auth wall, no Cloudflare on the detail page). - Anti-bot: Cloudflare on the search page → use headed Chromium under Xvfb (
Xvfb-runwrapper in Docker entrypoint). - Volume cadence: 120 min.
- Quirks: thin tile data → score conservatively until detail-fetch lands.
Consortia¶
- URL pattern:
https://www.consortia.com/jobs/?rcf_id=<filter_ids>&jobtype=<perm|contract>&searchid=<id>&loc_option=<remote|...>. - Tile contents: title, location, salary range, brief snippet (~100-200 chars), job ID.
- Detail fetch: generic HTTP fetcher (
detail_fetcher.py) on the detail URL. - Anti-bot: none for browse — public board.
- Render: JS-hydrated tiles. We wait 5s after
load, then parse.jobContainerDOM. - Volume cadence: 120 min.
- Quirks: PM-specialist agency, so lane fit is usually high. Filled vacancies tend to disappear from search results (no special handling needed).
IntelligentPeople¶
- URL pattern:
https://www.intelligentpeople.co.uk/jobs/?_sft_job_discipline=product-management&_sft_job_county=<remote|remote-england>&.... Hybrid via_sf_s=hybridparam. - Tile contents: title, salary range, location, snippet, job type (perm / contract).
- Detail fetch: generic HTTP fetcher.
- Anti-bot: none for browse.
- Render: JS-hydrated, similar to Consortia.
- Special handling: scraper SKIPS filled vacancies automatically (detects
.expired--messageorexpiredclass on.job-card--body). - Volume cadence: 120 min.
- Quirks: PM-specialist. Filled vacancies are visible on the public page but greyed out — we skip them.
Generic engine (one engine, per-site YAML config)¶
- Sites currently configured: CWJobs, CV-Library, Built In London, Hays Digital Technology, Robert Walters, Michael Page Technology, Sphere Digital Recruitment, Cranberry Panda, Jobserve.
- Render mode: per-site (
httporbrowser). Sites with JS-rendered job lists usebrowser(patchright); SSR sites usehttp. - Tile contents: varies wildly by site. Some have full descriptions; most have snippets only.
- Detail fetch: generic HTTP fetcher on the linked detail page.
- Anti-bot: varies. Cloudflare-blocked sites (Robert Walters, CV-Library) require
browsermode. Others fetch fine withhttp. - Per-site browser profile: generic configs can specify
profile: <name>to use a dedicated browser profile (e.g. forwelcometothejungleonce login is seeded). - Cadence: 240 min default — these are lower-priority than the dedicated agency boards.
- Silent-breakage canary: the engine logs WARNING when a site returns 0 jobs unexpectedly — paired with
search_runs.jobs_foundtracking we can spot DOM drift. - Quirks: selectors are best-guess starting points; if a site returns 0 unexpectedly, dig into its DOM and refine the per-site selector config.
Welcome to the Jungle (PLANNED)¶
- Status: scraper not built. Awaiting (a) login profile seeded via
browser_login.py welcometothejungle, (b) a search URL from the user (the URL provided so far is a single-job link). - Render: JS-rendered SPA, login-required for jobs view.
- Plan: use the generic engine with
render: browserandprofile: welcometothejungle. - Profile context: Jonny's WTTJ profile is at 84% completion; AI/Claude/vibe-coding angle missing from current bullets — separate task to update profile content.
Sufficiency gate & detail-fetch routing¶
Every job in the DB must pass _is_description_sufficient() in scripts/score_jobs.py BEFORE scoring. The check:
1. Length ≥ 500 chars.
2. Doesn't end with a truncation marker (read more, see more, show more, view more, load more, ..., …).
3. Ends with terminal punctuation (.!?")]}).
4. Contains a structural signal (bullet, OR words like requirements / responsibilities / about / experience / qualifications / we offer / you'll / etc).
If any check fails → fetch via the source-appropriate fetcher:
- linkedin → LinkedInScraper.fetch_detail(url)
- indeed → IndeedScraper.fetch_description(url) (via FlareSolverr)
- everything else → detail_fetcher.fetch_description_http(url) (plain httpx + body extraction)
After fetch, re-check sufficiency. If still insufficient → increment score_attempts. After 5 failed attempts → mark status='unscorable' and surface in dashboard for paste-and-rescore.
How to add a new source¶
- Easy path (use the generic engine): add a YAML block under
generic:inconfig/searches.yamlwith the per-site config (URL, render mode, container selector, title/link/etc selectors, cadence, url_prefix, country_default). Test by running the scraper for that source once. Refine the container selector if the canary fires "0 jobs". - Custom path (when generic doesn't fit): copy
src/scrapers/consortia.pyas a template, build a per-source class withfetch_jobs()+fetch_many(), register inscripts/run_scrape.py:SCRAPER_CLASSES, add the YAML section tosearches.yaml. - Always update this catalogue with the new source's nuances.
Test methodology when a source breaks¶
- Check
search_runstable for the latestjobs_foundcount for the source — silent zeroes mean DOM drift. - Run the scraper manually:
docker exec jobhunt sh -c "cd /app && xvfb-run -a /app/.venv/bin/python scripts/run_scrape.py --force"(or--source <name>if implemented). - Inspect saved debug HTML files in
strategy/upwork/or/tmp/for selector mismatches. - For LinkedIn: check for
/authwallredirect language in the response HTML. - For Cloudflare-protected sites: check FlareSolverr health (
curl http://localhost:8191/v1should return JSON).