Crawlee Scraper

Primary scraper. File: src/enrichment/sources/crawlee.ts (~500 lines). Activated with USE_CRAWLEE=true.

Crawlee wraps Playwright and handles the crawl loop. Required because many Swedish SME sites render via JS frameworks (React, Wix); plain cheerio misses them.

Crawl behaviour

Up to 12 pages per company
Prioritises: /kontakt, /om-oss, /team, /personal, /medarbetare, /about
Skips: /blogg, /nyheter, /karriar, /produkt, /priser

Six extraction strategies (run in parallel per page)

DOM extraction — page.evaluate() finds containers with CSS classes person|member|employee|staff|team|contact|people|colleague, extracts name/role/email/phone from sub-elements.
Text extraction — splits page plain text into lines, accepts lines passing isValidPersonName(), pulls role from next line, scans ±150-char window for email and phone.
Email local-part extraction — derives names from emails. kristofer@snoggy.se → “Kristofer”. anna.svensson@company.se → “Anna Svensson”. Filters generic prefixes (info, support, hej, kontakt).
Swedish prose patterns — regex like /Kontakta\s+([A-ZÅÄÖ]...)/, /VD:\s*(...)/, /Ansvarig:\s*(...)/.
Image alt text — scans <img alt>. Single-word alts must match ^[\p{Lu}][\p{Ll}]{2,}$ (rejects IMG_3245, Photo1).
JSON-LD — see JSON-LD Extraction. Single biggest gain in autoresearch (+18.7 composite points vs baseline).

All six merge and deduplicate into one contact list. Every name passes Name Validation before acceptance.

Known wins from autoresearch

JSON-LD added in round 1: +18.7 composite
See Experiment Results for the full table

Skip when

Set USE_FIRECRAWL=true to use the LLM-based Firecrawl extractor instead. Phase 2 A/B comparison not yet run — don’t switch defaults blindly.

EnrichNode Wiki

Explorer

Crawlee Scraper

Crawlee Scraper

Crawl behaviour

Six extraction strategies (run in parallel per page)

Known wins from autoresearch

Skip when

See also

Graph View

Table of Contents

Backlinks