Scope

A single day, 2026-03-25, in which domain discovery was rewritten, hardened, scored, benchmarked, and patched five times. The most concentrated bug-fix burst in the project’s history.

The same problem (find the real corporate website for a Swedish SME) was attacked from five angles: blocklist, content validation, registry lookup, TLD strategy, and partial enrichment when no domain is found.

Commits in chronological order

23f6c2f — 2026-03-25 — fix: domain resolution hardening, eniro ToS removal, kundkort v2

  • isBlockedDomain() blocks INVALID_DOMAINS plus all 290 Swedish municipality .se domains via MUNICIPALITY_DOMAIN_PATTERN. Fixed cases where stockholm.se etc. were winning.
  • Content-validate top-3 Serper candidates before accepting any domain. Prevents hosting and wrong-company domains from winning by default.
  • Removed dead getDomainFromRegistry / pool imports — the domain_registry table did not exist yet, so every enrichment run was crashing on the import.
  • pipeline.ts: domainValidated flag gates the SMTP and email block. No raw email guess assigned on SMTP error.
  • Added 'domain_not_found' to enrichment_status union.
  • INVALID_NAME_STANDALONE_WORDS: blocks “Talks British English”, “Google Play”, “Besök”, “Destination” from passing as person names.

50390db — 2026-03-25 — feat: add authoritative registry-based domain lookup

Plain feature commit, sparse body. Adds the registry path that 23f6c2f had to rip out.

ce92174 — 2026-03-25 — feat: IIS .se zone registry (Tier 0) for domain resolution

The substantive registry implementation:

  • download-zone.py — pulls the ~1.47M-row .se zone file from zonedata.iis.se (no longer hardcoded IP). SOA serial check skips re-download when zone unchanged.
  • migrations/006_domain_registry.sqlpg_trgm extension + domain_registry table with GIN trigram index on normalized_name.
  • load-registry.ts — rewritten for Bun.sql, streaming line reader, unnest bulk upsert in batches of 10k.
  • registry.ts — implements getDomainFromRegistry() using pg_trgm similarity (threshold 0.35). Returns null gracefully if table missing.
  • domain.ts — wired Tier 0 between Redis cache and Serper. Free zone lookup before spending Serper budget. Content-validates the registry hit before accepting.

This is the IIS lookup referenced in Domain Discovery.

6de209b — 2026-03-25 — refactor: domain discovery prioritizes registry lookup

Prioritisation re-order between cache, registry, and Serper. Sparse body — see the diff.

d570bc3 — 2026-03-25 — test: add latency benchmark for domain discovery

Latency benchmark script. No production code change.

cda3728 — 2026-03-25 — docs: add status report, update implementation plans and commit refactored domain discovery pipeline

Documentation + the refactored discovery pipeline commit. Body sparse. Likely the consolidated state after the five iterations above.

d78d33e — 2026-03-25 — fix: address all code-review findings — CRITICAL/HIGH/MEDIUM

Compliance follow-up to the discovery work but also broader. Critical items:

  • C1: reklamspärr filter applied to export endpoints. Companies with advertising_block=true are excluded from CSV/JSON exports.
  • C2: MAX_EXPORT_LIMIT enforced unconditionally. Unlimited requests now require admin role.
  • C3: Gate 0 (reklamspärr) added inside enrichV7() itself — callers that bypass the dispatcher cannot enrich blocked companies.
  • C4 / H1: when assigning a generated email, set contact.source = 'generated_guess' to preserve provenance.
  • C5: HASH_SALT warning fires in all environments, not only production.

High: H2 wires validateCompanyDomain() into the domain guess loop — reachable but unrelated domains rejected via content scoring.

ac8256f — 2026-03-25 — docs: generate updated kundkort and add contact extraction to roadmap

Generated kundkort + roadmap update. Documentation only.

68a59da — 2026-03-25 — chore: remove temporary test companies from benchmark

Housekeeping.

b0122f3 — 2026-03-25 — feat: rebuild kundkort with canonical field spec

Strictly a kundkort change but committed in the same day. The kundkort now has the canonical 18-field spec (orgnr, SNI, hemsida, sammanfattning, land, anställda 2023/2024/2025, växelnummer, kundtjänst-mejl, omsättning, resultat, koncern, ägda varumärken, kontaktpersoner per roll, upphandlingar, källor). GAP cards render when a field has no value. Role normalisation via ROLE_PATTERNS regex array.

3cfa926 — 2026-03-25 — fix: partial enrichment for .com domains + no domain halt

  • domain.ts — added a 4th Serper query without site:.se so Swedish companies on .com are discovered. Scored .se / .nu = 10, .com = 6 so Swedish TLDs stay preferred.
  • pipeline.ts — removed the domain-required early halt. Maps + LinkedIn + News now run unconditionally so a company without a validated domain still gets phone, address, contacts, growth signals. enrichment_status only set to domain_not_found if all sources return empty.

Concrete win cited in body: “TMP I Uppsala AB now enhanced with phone +46 18-60 72 70 from Google Maps instead of domain_not_found.”

5da08d0 — 2026-03-25 — fix: domain resolution — find .com domains, block junk .se directories

The follow-up: .com discovery from 3cfa926 was being blocked by junk .se directory aggregators winning the .se score-10 slot. Fix:

  • domain.ts — expanded content-validation slice 3 → 6 so .com candidates behind junk .se aggregators still reach validation.
  • config.ts — added allahemsidor.se, bygg.se, byggtjanst.se, byggforetagen.se, hantverkskollen.se, syna.se, vainu.com, bisnode.se, creditsafe.se, soliditet.se to INVALID_DOMAINS.
  • Result cited: TMP I Uppsala AB now resolves to tmpab.com (not blocked by aggregator), pulls phone +46 18-60 72 70 and info@tmpab.com. Lead score 0 → 4.1.

Significance

This day produced the discovery pipeline that Domain Discovery documents today: cache → IIS registry (Tier 0) → Serper with content validation → blocklist filtering → fall through to Maps/LinkedIn/News if nothing validates.

See also

Domain Discovery, Domain Blocklist, Crawlee Scraper, Lead Scoring, History Overview.

See also