Crawlee Scraper

Primary scraper. File: src/enrichment/sources/crawlee.ts (~500 lines). Activated with USE_CRAWLEE=true.

Crawlee wraps Playwright and handles the crawl loop. Required because many Swedish SME sites render via JS frameworks (React, Wix); plain cheerio misses them.

Crawl behaviour

  • Up to 12 pages per company
  • Prioritises: /kontakt, /om-oss, /team, /personal, /medarbetare, /about
  • Skips: /blogg, /nyheter, /karriar, /produkt, /priser

Six extraction strategies (run in parallel per page)

  1. DOM extractionpage.evaluate() finds containers with CSS classes person|member|employee|staff|team|contact|people|colleague, extracts name/role/email/phone from sub-elements.
  2. Text extraction — splits page plain text into lines, accepts lines passing isValidPersonName(), pulls role from next line, scans ±150-char window for email and phone.
  3. Email local-part extraction — derives names from emails. kristofer@snoggy.se → “Kristofer”. anna.svensson@company.se → “Anna Svensson”. Filters generic prefixes (info, support, hej, kontakt).
  4. Swedish prose patterns — regex like /Kontakta\s+([A-ZÅÄÖ]...)/, /VD:\s*(...)/, /Ansvarig:\s*(...)/.
  5. Image alt text — scans <img alt>. Single-word alts must match ^[\p{Lu}][\p{Ll}]{2,}$ (rejects IMG_3245, Photo1).
  6. JSON-LD — see JSON-LD Extraction. Single biggest gain in autoresearch (+18.7 composite points vs baseline).

All six merge and deduplicate into one contact list. Every name passes Name Validation before acceptance.

Known wins from autoresearch

Skip when

Set USE_FIRECRAWL=true to use the LLM-based Firecrawl extractor instead. Phase 2 A/B comparison not yet run — don’t switch defaults blindly.

See also

Name Validation, JSON-LD Extraction, EnrichV7, Firecrawl, Autoresearch Loop.