Data Sources Catalogue¶

Operational reference for every place we pull jobs from. For each source: what's in the tile, what needs a detail-fetch, auth requirements, anti-bot characteristics, latency, known quirks, and which scoring track applies.

Last updated: 2026-06-03.

Quick reference table¶

Source	File	Render	Auth needed	Description in tile?	Cadence	Scoring track
Upwork	`upwork.py`	SSR via patchright (logged-in profile)	Yes - Freelancer account	✅ full	15 min	UPWORK
LinkedIn	`linkedin.py`	guest-mode search + detail-fetch	No for search; detail fetches behind authwall	❌ tile-only (title + company + location + date)	120 min	NON-UPWORK
Indeed	`indeed.py`	via FlareSolverr sidecar	No (FlareSolverr passes Cloudflare)	✅ snippet, full description via detail fetch	60 min	NON-UPWORK
RemoteRocketship	`remoterocketship.py`	headed Chromium under Xvfb (Cloudflare)	No	🟡 thin (title + company + location + seniority + funding)	120 min	NON-UPWORK
Consortia	`consortia.py`	JS-rendered, settle-and-parse	No	🟡 snippet only	120 min	NON-UPWORK
IntelligentPeople	`intelligentpeople.py`	JS-rendered, skips filled	No	🟡 snippet, full salary + location	120 min	NON-UPWORK
Welcome to the Jungle	(planned)	JS-rendered, requires login	Yes - WTTJ account	❓ TBD	TBD	NON-UPWORK
Generic engine (CWJobs, CV-Library, Built In London, Hays, Robert Walters, Michael Page, Sphere, Cranberry Panda, Jobserve)	`generic.py`	per-site config: http or browser	Per-site	🟡 varies by config	240 min	NON-UPWORK
Detail-fetch fallback	`detail_fetcher.py`	httpx + BeautifulSoup	No	n/a (post-tile enrichment)	on-demand	n/a

Legend: ✅ full description / 🟡 partial or snippet / ❌ tile only

Per-source nuances¶

Upwork¶

URL pattern: https://www.upwork.com/nx/find-work/<topic_id> (NOT /nx/search/jobs/?topic_id=<id> — that's Cloudflare-protected and we can't pass it without JS).
Why we use find-work, not search/jobs: the /nx/find-work/ route returns 2 SSR'd job tiles per request and is NOT Cloudflare-protected. The /nx/search/jobs/ route is behind an interactive challenge that kills our session.
Profile required: Freelancer account (set during browser_login.py upwork - pick FREELANCER on the account-selection screen, never Client or Agency). The upwork-client profile exists separately for talent-search scraping (mining top freelancers).
Description quality: full, included in tile.
Anti-bot: mild on find-work; aggressive on search/jobs and talent-search. Cookies expire — re-run browser_login.py upwork if scraper logs served the account-selection screen.
Quirks: Upwork auto-appends "I am willing to pay higher rates for the most experienced freelancers" when client picks Expert level — IGNORE as client voice in scoring.
Per-search throttling: 15-min default cadence; some searches have low daily volume.

LinkedIn¶

URL pattern (search): https://www.linkedin.com/jobs/search/?... with geoId + keywords + f_WT=2 (remote) parameters. NOT /jobs/search-results/ (forces login).
URL pattern (detail): https://uk.linkedin.com/jobs/view/<slug>-<id>. Auth-walled if accessed too aggressively.
Tile contents: title, company, location, posted date. NO description.
Detail fetch: required for any meaningful scoring. Lives in linkedin.py:fetch_detail(). Returns {description, budget, criteria}.
Anti-bot: STRONG. Detail-page fetches behind a per-IP rate limit; aggressive fetching → /authwall redirect. Defence:
30-60s jitter between detail fetches (in JITTER_BETWEEN_DETAILS_S).
Authwall latch: if /authwall is hit, the scraper sets self._authwalled=True and short-circuits all further detail fetches for the rest of the run.
Throwaway browser profile per LinkedIn request (no persisted cookies).
Volume cadence: 120 min default. Anything more aggressive risks /authwall on the IP.
Per-region searches: geoId 101165590 = United Kingdom, 105646813 = Spain, 105512687 = Valencia.
Quirks:
"Easy Apply" tag = higher volume / lower quality.
Listing tile doesn't include description; if our detail-fetch is in throttle backoff, the job stays in the unscored queue until the next cycle.
No cap on detail-fetches per run as of 2026-06-03 (the old --max-li-details 30 was removed). Rely on jitter + authwall-latch.
Backfill: scripts/backfill_descriptions.py --source linkedin for historical empty descriptions.

Indeed¶

URL pattern: https://uk.indeed.com/jobs?q=title%3A%22<term>%22&l=&sc=0kf%3Aattr(DSQF7)%3B&from=searchOnDesktopSerp for UK remote; https://es.indeed.com/jobs?... for Spain.
Tile contents: title, company, location, salary if present, brief snippet.
Detail fetch: indeed.py:fetch_description(). Returns full body text.
Anti-bot: Cloudflare Turnstile. Our defence is a FlareSolverr sidecar container (FLARESOLVERR_URL env var, default http://flaresolverr:8191/v1). Each fetch takes 5-15s because FlareSolverr's own browser waits out the challenge.
Volume cadence: 60 min default (longer than Upwork due to slow FlareSolverr cycle).
Quirks:
Indeed silently changes filter URLs occasionally — if a source returns 0 over several runs, check the saved URL still resolves.
UK Indeed has many duplicate listings (agencies re-posting); dedup by source_id handles most.
Spain Indeed → only score English-language ads per criteria.

RemoteRocketship¶

URL pattern: https://www.remoterocketship.com/country/<slug>/jobs/<role-slug>/?...&isOnLinkedIn=false.
Why the isOnLinkedIn=false filter: specifically catches jobs NOT on LinkedIn (so we de-duplicate against our LinkedIn searches).
Tile contents: title, company, location, seniority, employment type, funding stage. NO full description.
Detail fetch: can use detail_fetcher.fetch_description_http() (plain HTTP - no auth wall, no Cloudflare on the detail page).
Anti-bot: Cloudflare on the search page → use headed Chromium under Xvfb (Xvfb-run wrapper in Docker entrypoint).
Volume cadence: 120 min.
Quirks: thin tile data → score conservatively until detail-fetch lands.

Consortia¶

URL pattern: https://www.consortia.com/jobs/?rcf_id=<filter_ids>&jobtype=<perm|contract>&searchid=<id>&loc_option=<remote|...>.
Tile contents: title, location, salary range, brief snippet (~100-200 chars), job ID.
Detail fetch: generic HTTP fetcher (detail_fetcher.py) on the detail URL.
Anti-bot: none for browse — public board.
Render: JS-hydrated tiles. We wait 5s after load, then parse .jobContainer DOM.
Volume cadence: 120 min.
Quirks: PM-specialist agency, so lane fit is usually high. Filled vacancies tend to disappear from search results (no special handling needed).

IntelligentPeople¶

URL pattern: https://www.intelligentpeople.co.uk/jobs/?_sft_job_discipline=product-management&_sft_job_county=<remote|remote-england>&.... Hybrid via _sf_s=hybrid param.
Tile contents: title, salary range, location, snippet, job type (perm / contract).
Detail fetch: generic HTTP fetcher.
Anti-bot: none for browse.
Render: JS-hydrated, similar to Consortia.
Special handling: scraper SKIPS filled vacancies automatically (detects .expired--message or expired class on .job-card--body).
Volume cadence: 120 min.
Quirks: PM-specialist. Filled vacancies are visible on the public page but greyed out — we skip them.

Generic engine (one engine, per-site YAML config)¶

Sites currently configured: CWJobs, CV-Library, Built In London, Hays Digital Technology, Robert Walters, Michael Page Technology, Sphere Digital Recruitment, Cranberry Panda, Jobserve.
Render mode: per-site (http or browser). Sites with JS-rendered job lists use browser (patchright); SSR sites use http.
Tile contents: varies wildly by site. Some have full descriptions; most have snippets only.
Detail fetch: generic HTTP fetcher on the linked detail page.
Anti-bot: varies. Cloudflare-blocked sites (Robert Walters, CV-Library) require browser mode. Others fetch fine with http.
Per-site browser profile: generic configs can specify profile: <name> to use a dedicated browser profile (e.g. for welcometothejungle once login is seeded).
Cadence: 240 min default — these are lower-priority than the dedicated agency boards.
Silent-breakage canary: the engine logs WARNING when a site returns 0 jobs unexpectedly — paired with search_runs.jobs_found tracking we can spot DOM drift.
Quirks: selectors are best-guess starting points; if a site returns 0 unexpectedly, dig into its DOM and refine the per-site selector config.

Welcome to the Jungle (PLANNED)¶

Status: scraper not built. Awaiting (a) login profile seeded via browser_login.py welcometothejungle, (b) a search URL from the user (the URL provided so far is a single-job link).
Render: JS-rendered SPA, login-required for jobs view.
Plan: use the generic engine with render: browser and profile: welcometothejungle.
Profile context: Jonny's WTTJ profile is at 84% completion; AI/Claude/vibe-coding angle missing from current bullets — separate task to update profile content.

Sufficiency gate & detail-fetch routing¶

Every job in the DB must pass _is_description_sufficient() in scripts/score_jobs.py BEFORE scoring. The check: 1. Length ≥ 500 chars. 2. Doesn't end with a truncation marker (read more, see more, show more, view more, load more, ..., …). 3. Ends with terminal punctuation (.!?")]}). 4. Contains a structural signal (bullet, OR words like requirements / responsibilities / about / experience / qualifications / we offer / you'll / etc).

If any check fails → fetch via the source-appropriate fetcher: - linkedin → LinkedInScraper.fetch_detail(url) - indeed → IndeedScraper.fetch_description(url) (via FlareSolverr) - everything else → detail_fetcher.fetch_description_http(url) (plain httpx + body extraction)

After fetch, re-check sufficiency. If still insufficient → increment score_attempts. After 5 failed attempts → mark status='unscorable' and surface in dashboard for paste-and-rescore.

How to add a new source¶

Easy path (use the generic engine): add a YAML block under generic: in config/searches.yaml with the per-site config (URL, render mode, container selector, title/link/etc selectors, cadence, url_prefix, country_default). Test by running the scraper for that source once. Refine the container selector if the canary fires "0 jobs".
Custom path (when generic doesn't fit): copy src/scrapers/consortia.py as a template, build a per-source class with fetch_jobs() + fetch_many(), register in scripts/run_scrape.py:SCRAPER_CLASSES, add the YAML section to searches.yaml.
Always update this catalogue with the new source's nuances.

Test methodology when a source breaks¶

Check search_runs table for the latest jobs_found count for the source — silent zeroes mean DOM drift.
Run the scraper manually: docker exec jobhunt sh -c "cd /app && xvfb-run -a /app/.venv/bin/python scripts/run_scrape.py --force" (or --source <name> if implemented).
Inspect saved debug HTML files in strategy/upwork/ or /tmp/ for selector mismatches.
For LinkedIn: check for /authwall redirect language in the response HTML.
For Cloudflare-protected sites: check FlareSolverr health (curl http://localhost:8191/v1 should return JSON).