Tests for the EnrichV7 pipeline, contact extractors, and supporting processors. Mostly fast and deterministic — uses inline HTML fixtures and injected mock clients. Two integration tests hit the live network.

Files

FileLinesNetwork?
tests/enrichment/processors.test.ts183No
tests/enrichment/crawlee-quality.test.ts308No
tests/enrichment/firecrawl.test.ts512No (mock client)
tests/enrichment/board-members-integration.test.ts31Yes — calls enrichV7() end-to-end
src/enrichmentEngine.v7.test.ts339No
autoresearch/regression.test.ts283No — see Autoresearch Loop

processors.test.ts

Unit tests for src/enrichment/processors/:

  • inferRoleType: VD/CEO → VD; CFO/Ekonomichef → CFO; PR-konsult / Copywriter / Kommunikationskonsult → Marknadschef; Rekryterare / Talent Manager → Personalansvarig; unknown → Övrig
  • normalizeName: strips Swedish/accent diacritics (Andréandre, Björnbjorn, Åsaasa)
  • maybeFlipSwedishName: flips Karlsson AnnaAnna Karlsson; leaves three-word names alone
  • normalizeCompanyName: strips trailing AB, HB, EF
  • extractBrandName: VendFox Solutions ABVendFox (strips generic word + suffix); returns undefined when no generic word
  • isValidPersonName: see Name Validation — accepts Swedish full names, rejects company suffixes, emails, single words, nav phrases
  • normSe: Swedish-char + non-alphanumeric strip (Åsaasa)
  • extractEmails: dedupe, lowercase
  • generateEmailGuesses: produces firstname.lastname@domain first, all confidence: 'low' and source: 'generated_guess'
  • detectEmailPattern: pattern detection from observed addresses
  • extractPhones: Swedish mobile (+46/07x) tagged mobile, landline (e.g. 08-) tagged landline
  • calculateLeadScore: 0 for empty, capped at 10, awards points for VD contact

crawlee-quality.test.ts

Tests extractContactsFromText and extractContactsFromImgAlts from src/enrichment/sources/crawlee.ts using inline HTML/text fixtures.

Key assertions:

  • Three-contact vendfox-style fixture: name + role + email + phone correctly attached, no email bleed-over to the next contact
  • Phone type: mobile vs landline classified
  • Rejects nav phrases (Om Oss), single words, company-suffix names (Bolaget AB)
  • Image-alt extraction: <img src="/team/anna.jpg" alt="Anna Karlsson VD" /> → contact with name='Anna Karlsson', role='VD'
  • Strips company-name words from extracted role (vendfox removed when company = VendFox Solutions AB)
  • All-caps role tokens (VD, CEO) treated as role, not part of name
  • Path separator handling: team-anna.jpg, team_photo_anna.jpg both match the team-image heuristic
  • Filename false-positive guards: rejects More_PR_Team, Photo1
  • Single-quoted HTML attributes parsed (regression for the bug noted in feedback_crawlee_patterns.md)
  • Dutch/Swedish name particles (van, de, af) accepted
  • inferRoleType PR/communications expansion verified

firecrawl.test.ts

Tests src/enrichment/sources/firecrawl.ts without a real Firecrawl API key. Uses the test-only hooks _setClientForTest() and _resetClientForTest().

Coverage:

  • discoverContactPages: caps at 3 results; accepts /kontakt, /contact, /om-oss, /team, /medarbetare, /ledning, /styrelse; depth filter accepts depth ≤ 2 (/en/contact ok, /nyheter/post/about rejected)
  • CONTACT_PAGE_PATTERNS regex matches expected paths and rejects /blog, /products, /nyheter
  • https:// guard: https://acme.se not double-prefixed to https://https://acme.se
  • Missing FIRECRAWL_API_KEY throws an informative error (does not silently return failed)
  • With mock client: returns shape compatible with scrapeWebsite(){contacts, emails, phones, tech_stack, social_links, services, method, headline, description}
  • isValidPersonName filter applied to LLM output: World Trade Center and Kontakt filtered out
  • Dedupe by lowercase name; emails lowercased and deduped; services capped at 8 and deduped
  • tech_stack always [] (Firecrawl extraction does not produce it)
  • Multi-page merge: contacts combine across pages; headline from first successful page only
  • guessContactUrls: probes common paths via injected fetch, falls back to homepage on all-404, capped at 3
  • extractContactsFromHtml cheerio fallback: finds names from <h3>, attaches nearby mailto: email, sets confidence: 'low', source: 'firecrawl-html-fallback'
  • HTML fallback used when Firecrawl returns 0 contacts but HTML contains names

board-members-integration.test.ts

Two integration tests that invoke enrichV7() against org_nr 5565672655 (Gordons Project AB) with bypass_cache: true. 30 second timeout each.

Warning

These tests hit live BV API + Serper + websites and will fail without network or with rate-limited keys. They assert only structural shape (contacts is array, updated_fields is array, company.org_nr matches) — not specific contact content.

src/enrichmentEngine.v7.test.ts

Unit tests for the v7 helpers exported from src/enrichment/index.ts. Overlaps significantly with processors.test.ts — same inferRoleType / isValidPersonName / normalizeCompanyName / normSe / extractPhones / extractEmails cases — plus the Article 14 notification module (src/lib/article14Notification.ts) tested with a mock pg pool: NO_EMAIL short-circuit, alreadySent detection via SELECT, RoPA insert on attempt, bulk notifyDataSubjects. See Article 14.

Note

processors.test.ts and src/enrichmentEngine.v7.test.ts duplicate ~40% of their assertions. Either could be deleted with no coverage loss. Same applies to the role-mapping cases in autoresearch/regression.test.ts.

See also

Test Strategy, EnrichV7, Crawlee Scraper, Firecrawl, Name Validation, Autoresearch Loop.

See also