Skip to content

Data Sources Catalogue

Operational reference for every place we pull jobs from. For each source: what's in the tile, what needs a detail-fetch, auth requirements, anti-bot characteristics, latency, known quirks, and which scoring track applies.

Last updated: 2026-06-03.


Quick reference table

Source File Render Auth needed Description in tile? Cadence Scoring track
Upwork upwork.py SSR via patchright (logged-in profile) Yes - Freelancer account ✅ full 15 min UPWORK
LinkedIn linkedin.py guest-mode search + detail-fetch No for search; detail fetches behind authwall ❌ tile-only (title + company + location + date) 120 min NON-UPWORK
Indeed indeed.py via FlareSolverr sidecar No (FlareSolverr passes Cloudflare) ✅ snippet, full description via detail fetch 60 min NON-UPWORK
RemoteRocketship remoterocketship.py headed Chromium under Xvfb (Cloudflare) No 🟡 thin (title + company + location + seniority + funding) 120 min NON-UPWORK
Consortia consortia.py JS-rendered, settle-and-parse No 🟡 snippet only 120 min NON-UPWORK
IntelligentPeople intelligentpeople.py JS-rendered, skips filled No 🟡 snippet, full salary + location 120 min NON-UPWORK
Welcome to the Jungle (planned) JS-rendered, requires login Yes - WTTJ account ❓ TBD TBD NON-UPWORK
Generic engine (CWJobs, CV-Library, Built In London, Hays, Robert Walters, Michael Page, Sphere, Cranberry Panda, Jobserve) generic.py per-site config: http or browser Per-site 🟡 varies by config 240 min NON-UPWORK
Detail-fetch fallback detail_fetcher.py httpx + BeautifulSoup No n/a (post-tile enrichment) on-demand n/a

Legend: ✅ full description / 🟡 partial or snippet / ❌ tile only


Per-source nuances

Upwork

  • URL pattern: https://www.upwork.com/nx/find-work/<topic_id> (NOT /nx/search/jobs/?topic_id=<id> — that's Cloudflare-protected and we can't pass it without JS).
  • Why we use find-work, not search/jobs: the /nx/find-work/ route returns 2 SSR'd job tiles per request and is NOT Cloudflare-protected. The /nx/search/jobs/ route is behind an interactive challenge that kills our session.
  • Profile required: Freelancer account (set during browser_login.py upwork - pick FREELANCER on the account-selection screen, never Client or Agency). The upwork-client profile exists separately for talent-search scraping (mining top freelancers).
  • Description quality: full, included in tile.
  • Anti-bot: mild on find-work; aggressive on search/jobs and talent-search. Cookies expire — re-run browser_login.py upwork if scraper logs served the account-selection screen.
  • Quirks: Upwork auto-appends "I am willing to pay higher rates for the most experienced freelancers" when client picks Expert level — IGNORE as client voice in scoring.
  • Per-search throttling: 15-min default cadence; some searches have low daily volume.

LinkedIn

  • URL pattern (search): https://www.linkedin.com/jobs/search/?... with geoId + keywords + f_WT=2 (remote) parameters. NOT /jobs/search-results/ (forces login).
  • URL pattern (detail): https://uk.linkedin.com/jobs/view/<slug>-<id>. Auth-walled if accessed too aggressively.
  • Tile contents: title, company, location, posted date. NO description.
  • Detail fetch: required for any meaningful scoring. Lives in linkedin.py:fetch_detail(). Returns {description, budget, criteria}.
  • Anti-bot: STRONG. Detail-page fetches behind a per-IP rate limit; aggressive fetching → /authwall redirect. Defence:
  • 30-60s jitter between detail fetches (in JITTER_BETWEEN_DETAILS_S).
  • Authwall latch: if /authwall is hit, the scraper sets self._authwalled=True and short-circuits all further detail fetches for the rest of the run.
  • Throwaway browser profile per LinkedIn request (no persisted cookies).
  • Volume cadence: 120 min default. Anything more aggressive risks /authwall on the IP.
  • Per-region searches: geoId 101165590 = United Kingdom, 105646813 = Spain, 105512687 = Valencia.
  • Quirks:
  • "Easy Apply" tag = higher volume / lower quality.
  • Listing tile doesn't include description; if our detail-fetch is in throttle backoff, the job stays in the unscored queue until the next cycle.
  • No cap on detail-fetches per run as of 2026-06-03 (the old --max-li-details 30 was removed). Rely on jitter + authwall-latch.
  • Backfill: scripts/backfill_descriptions.py --source linkedin for historical empty descriptions.

Indeed

  • URL pattern: https://uk.indeed.com/jobs?q=title%3A%22<term>%22&l=&sc=0kf%3Aattr(DSQF7)%3B&from=searchOnDesktopSerp for UK remote; https://es.indeed.com/jobs?... for Spain.
  • Tile contents: title, company, location, salary if present, brief snippet.
  • Detail fetch: indeed.py:fetch_description(). Returns full body text.
  • Anti-bot: Cloudflare Turnstile. Our defence is a FlareSolverr sidecar container (FLARESOLVERR_URL env var, default http://flaresolverr:8191/v1). Each fetch takes 5-15s because FlareSolverr's own browser waits out the challenge.
  • Volume cadence: 60 min default (longer than Upwork due to slow FlareSolverr cycle).
  • Quirks:
  • Indeed silently changes filter URLs occasionally — if a source returns 0 over several runs, check the saved URL still resolves.
  • UK Indeed has many duplicate listings (agencies re-posting); dedup by source_id handles most.
  • Spain Indeed → only score English-language ads per criteria.

RemoteRocketship

  • URL pattern: https://www.remoterocketship.com/country/<slug>/jobs/<role-slug>/?...&isOnLinkedIn=false.
  • Why the isOnLinkedIn=false filter: specifically catches jobs NOT on LinkedIn (so we de-duplicate against our LinkedIn searches).
  • Tile contents: title, company, location, seniority, employment type, funding stage. NO full description.
  • Detail fetch: can use detail_fetcher.fetch_description_http() (plain HTTP - no auth wall, no Cloudflare on the detail page).
  • Anti-bot: Cloudflare on the search page → use headed Chromium under Xvfb (Xvfb-run wrapper in Docker entrypoint).
  • Volume cadence: 120 min.
  • Quirks: thin tile data → score conservatively until detail-fetch lands.

Consortia

  • URL pattern: https://www.consortia.com/jobs/?rcf_id=<filter_ids>&jobtype=<perm|contract>&searchid=<id>&loc_option=<remote|...>.
  • Tile contents: title, location, salary range, brief snippet (~100-200 chars), job ID.
  • Detail fetch: generic HTTP fetcher (detail_fetcher.py) on the detail URL.
  • Anti-bot: none for browse — public board.
  • Render: JS-hydrated tiles. We wait 5s after load, then parse .jobContainer DOM.
  • Volume cadence: 120 min.
  • Quirks: PM-specialist agency, so lane fit is usually high. Filled vacancies tend to disappear from search results (no special handling needed).

IntelligentPeople

  • URL pattern: https://www.intelligentpeople.co.uk/jobs/?_sft_job_discipline=product-management&_sft_job_county=<remote|remote-england>&.... Hybrid via _sf_s=hybrid param.
  • Tile contents: title, salary range, location, snippet, job type (perm / contract).
  • Detail fetch: generic HTTP fetcher.
  • Anti-bot: none for browse.
  • Render: JS-hydrated, similar to Consortia.
  • Special handling: scraper SKIPS filled vacancies automatically (detects .expired--message or expired class on .job-card--body).
  • Volume cadence: 120 min.
  • Quirks: PM-specialist. Filled vacancies are visible on the public page but greyed out — we skip them.

Generic engine (one engine, per-site YAML config)

  • Sites currently configured: CWJobs, CV-Library, Built In London, Hays Digital Technology, Robert Walters, Michael Page Technology, Sphere Digital Recruitment, Cranberry Panda, Jobserve.
  • Render mode: per-site (http or browser). Sites with JS-rendered job lists use browser (patchright); SSR sites use http.
  • Tile contents: varies wildly by site. Some have full descriptions; most have snippets only.
  • Detail fetch: generic HTTP fetcher on the linked detail page.
  • Anti-bot: varies. Cloudflare-blocked sites (Robert Walters, CV-Library) require browser mode. Others fetch fine with http.
  • Per-site browser profile: generic configs can specify profile: <name> to use a dedicated browser profile (e.g. for welcometothejungle once login is seeded).
  • Cadence: 240 min default — these are lower-priority than the dedicated agency boards.
  • Silent-breakage canary: the engine logs WARNING when a site returns 0 jobs unexpectedly — paired with search_runs.jobs_found tracking we can spot DOM drift.
  • Quirks: selectors are best-guess starting points; if a site returns 0 unexpectedly, dig into its DOM and refine the per-site selector config.

Welcome to the Jungle (PLANNED)

  • Status: scraper not built. Awaiting (a) login profile seeded via browser_login.py welcometothejungle, (b) a search URL from the user (the URL provided so far is a single-job link).
  • Render: JS-rendered SPA, login-required for jobs view.
  • Plan: use the generic engine with render: browser and profile: welcometothejungle.
  • Profile context: Jonny's WTTJ profile is at 84% completion; AI/Claude/vibe-coding angle missing from current bullets — separate task to update profile content.

Sufficiency gate & detail-fetch routing

Every job in the DB must pass _is_description_sufficient() in scripts/score_jobs.py BEFORE scoring. The check: 1. Length ≥ 500 chars. 2. Doesn't end with a truncation marker (read more, see more, show more, view more, load more, ..., ). 3. Ends with terminal punctuation (.!?")]}). 4. Contains a structural signal (bullet, OR words like requirements / responsibilities / about / experience / qualifications / we offer / you'll / etc).

If any check fails → fetch via the source-appropriate fetcher: - linkedinLinkedInScraper.fetch_detail(url) - indeedIndeedScraper.fetch_description(url) (via FlareSolverr) - everything else → detail_fetcher.fetch_description_http(url) (plain httpx + body extraction)

After fetch, re-check sufficiency. If still insufficient → increment score_attempts. After 5 failed attempts → mark status='unscorable' and surface in dashboard for paste-and-rescore.


How to add a new source

  1. Easy path (use the generic engine): add a YAML block under generic: in config/searches.yaml with the per-site config (URL, render mode, container selector, title/link/etc selectors, cadence, url_prefix, country_default). Test by running the scraper for that source once. Refine the container selector if the canary fires "0 jobs".
  2. Custom path (when generic doesn't fit): copy src/scrapers/consortia.py as a template, build a per-source class with fetch_jobs() + fetch_many(), register in scripts/run_scrape.py:SCRAPER_CLASSES, add the YAML section to searches.yaml.
  3. Always update this catalogue with the new source's nuances.

Test methodology when a source breaks

  • Check search_runs table for the latest jobs_found count for the source — silent zeroes mean DOM drift.
  • Run the scraper manually: docker exec jobhunt sh -c "cd /app && xvfb-run -a /app/.venv/bin/python scripts/run_scrape.py --force" (or --source <name> if implemented).
  • Inspect saved debug HTML files in strategy/upwork/ or /tmp/ for selector mismatches.
  • For LinkedIn: check for /authwall redirect language in the response HTML.
  • For Cloudflare-protected sites: check FlareSolverr health (curl http://localhost:8191/v1 should return JSON).