Scope

Two-day window, 2026-03-26 → 2026-03-27. Three commits introduce Firecrawl as a feature-flagged alternative to Playwright extraction, then patch it.

The experiment ran in parallel with Playwright but was eventually superseded by Crawlee Scraper. Firecrawl remains in the codebase, gated behind USE_FIRECRAWL=true. See Firecrawl.

Commits

8952c8d — 2026-03-26 — feat: Firecrawl LLM extractor — Phase 1 (feature-flagged)

Co-author: Claude Sonnet 4.6.

  • New flag USE_FIRECRAWL=true routes website extraction through Firecrawl’s structured JSON extraction instead of Playwright. Playwright remains the default; no existing behaviour changes.
  • src/enrichment/sources/firecrawl.ts — extractor using @mendable/firecrawl-js. Lazy client singleton (test-isolatable via _setClientForTest). Zod schema for contacts, address, services, social_links. Scrapes all contact pages in parallel and merges. Applies isValidPersonName() + inferRoleType() on LLM output. Throws on missing FIRECRAWL_API_KEY.
  • src/enrichment/sources/website.ts — dispatches to Firecrawl or Playwright.
  • src/enrichment/types.ts'firecrawl' added to website_method union.
  • tests/enrichment/firecrawl.test.ts — 25 tests, no API key required.
  • docs/FIRECRAWL_MIGRATION.md — architecture, credit model, phase roadmap.

8f2d1a5 — 2026-03-26 — fix: Firecrawl code review fixes

Sparse body. Code-review patch the same day as introduction.

549dd51 — 2026-03-27 — feat: improve Firecrawl contact extraction with URL guessing, HTML fallback, and better prompt

Co-author: Claude Sonnet 4.6.

  • Fixed SDK v4 format: json must be inline in formats array (not jsonOptions).
  • Added guessContactUrls() — probes 8 common Swedish contact paths via HEAD when Firecrawl map() returns nothing. Prevents falling back to homepage only.
  • Added extractContactsFromHtml() — cheerio fallback for zero-contact LLM results. Parses h2/h3/h4/strong for person names with role and mailto:.
  • Improved LLM prompt: explicitly asks for team members, executives, board members. Swedish section names (Om oss, Medarbetare, Ledning).
  • Added HTML format to Firecrawl scrape response for fallback without extra requests.
  • Extracted ENRICH_USER_AGENT to shared constant in config.ts (used by domain.ts too).
  • Fixed double inferRoleType() call.
  • 13 new tests: guessContactUrls (4), extractContactsFromHtml (7), html fallback (2).

Significance

  • Firecrawl was the first attempt to use an LLM-based extractor for contacts.
  • Phase 2 A/B comparison vs Playwright is referenced in docs/FIRECRAWL_MIGRATION.md but never appears in commit history. By 2026-04-01 the project pivots to Crawlee. See History Crawlee Era.
  • The HTML and URL-guessing fallbacks added in 549dd51 hedge against the LLM returning empty — implicit acknowledgement that the LLM extraction alone was unreliable.

2026-04-02 — USE_CRAWLEE is the recommended scraper. Firecrawl is dormant unless explicitly enabled.

See also

Firecrawl, Crawlee Scraper, History Crawlee Era, History Overview.

See also