Crawlee Scraper
Primary scraper. File: src/enrichment/sources/crawlee.ts (~500 lines). Activated with USE_CRAWLEE=true.
Crawlee wraps Playwright and handles the crawl loop. Required because many Swedish SME sites render via JS frameworks (React, Wix); plain cheerio misses them.
Crawl behaviour
- Up to 12 pages per company
- Prioritises:
/kontakt,/om-oss,/team,/personal,/medarbetare,/about - Skips:
/blogg,/nyheter,/karriar,/produkt,/priser
Six extraction strategies (run in parallel per page)
- DOM extraction —
page.evaluate()finds containers with CSS classesperson|member|employee|staff|team|contact|people|colleague, extracts name/role/email/phone from sub-elements. - Text extraction — splits page plain text into lines, accepts lines passing
isValidPersonName(), pulls role from next line, scans ±150-char window for email and phone. - Email local-part extraction — derives names from emails.
kristofer@snoggy.se→ “Kristofer”.anna.svensson@company.se→ “Anna Svensson”. Filters generic prefixes (info,support,hej,kontakt). - Swedish prose patterns — regex like
/Kontakta\s+([A-ZÅÄÖ]...)/,/VD:\s*(...)/,/Ansvarig:\s*(...)/. - Image alt text — scans
<img alt>. Single-word alts must match^[\p{Lu}][\p{Ll}]{2,}$(rejectsIMG_3245,Photo1). - JSON-LD — see JSON-LD Extraction. Single biggest gain in autoresearch (+18.7 composite points vs baseline).
All six merge and deduplicate into one contact list. Every name passes Name Validation before acceptance.
Known wins from autoresearch
- JSON-LD added in round 1: +18.7 composite
- See Experiment Results for the full table
Skip when
Set USE_FIRECRAWL=true to use the LLM-based Firecrawl extractor instead. Phase 2 A/B comparison not yet run — don’t switch defaults blindly.
See also
Name Validation, JSON-LD Extraction, EnrichV7, Firecrawl, Autoresearch Loop.