Scope
A single day, 2026-03-25, in which domain discovery was rewritten, hardened, scored, benchmarked, and patched five times. The most concentrated bug-fix burst in the project’s history.
The same problem (find the real corporate website for a Swedish SME) was attacked from five angles: blocklist, content validation, registry lookup, TLD strategy, and partial enrichment when no domain is found.
Commits in chronological order
23f6c2f — 2026-03-25 — fix: domain resolution hardening, eniro ToS removal, kundkort v2
isBlockedDomain()blocksINVALID_DOMAINSplus all 290 Swedish municipality.sedomains viaMUNICIPALITY_DOMAIN_PATTERN. Fixed cases where stockholm.se etc. were winning.- Content-validate top-3 Serper candidates before accepting any domain. Prevents hosting and wrong-company domains from winning by default.
- Removed dead
getDomainFromRegistry/poolimports — thedomain_registrytable did not exist yet, so every enrichment run was crashing on the import. pipeline.ts:domainValidatedflag gates the SMTP and email block. No raw email guess assigned on SMTP error.- Added
'domain_not_found'toenrichment_statusunion. INVALID_NAME_STANDALONE_WORDS: blocks “Talks British English”, “Google Play”, “Besök”, “Destination” from passing as person names.
50390db — 2026-03-25 — feat: add authoritative registry-based domain lookup
Plain feature commit, sparse body. Adds the registry path that 23f6c2f had to rip out.
ce92174 — 2026-03-25 — feat: IIS .se zone registry (Tier 0) for domain resolution
The substantive registry implementation:
download-zone.py— pulls the ~1.47M-row.sezone file fromzonedata.iis.se(no longer hardcoded IP). SOA serial check skips re-download when zone unchanged.migrations/006_domain_registry.sql—pg_trgmextension +domain_registrytable with GIN trigram index onnormalized_name.load-registry.ts— rewritten for Bun.sql, streaming line reader,unnestbulk upsert in batches of 10k.registry.ts— implementsgetDomainFromRegistry()usingpg_trgmsimilarity (threshold 0.35). Returns null gracefully if table missing.domain.ts— wired Tier 0 between Redis cache and Serper. Free zone lookup before spending Serper budget. Content-validates the registry hit before accepting.
This is the IIS lookup referenced in Domain Discovery.
6de209b — 2026-03-25 — refactor: domain discovery prioritizes registry lookup
Prioritisation re-order between cache, registry, and Serper. Sparse body — see the diff.
d570bc3 — 2026-03-25 — test: add latency benchmark for domain discovery
Latency benchmark script. No production code change.
cda3728 — 2026-03-25 — docs: add status report, update implementation plans and commit refactored domain discovery pipeline
Documentation + the refactored discovery pipeline commit. Body sparse. Likely the consolidated state after the five iterations above.
d78d33e — 2026-03-25 — fix: address all code-review findings — CRITICAL/HIGH/MEDIUM
Compliance follow-up to the discovery work but also broader. Critical items:
- C1: reklamspärr filter applied to export endpoints. Companies with
advertising_block=trueare excluded from CSV/JSON exports. - C2:
MAX_EXPORT_LIMITenforced unconditionally. Unlimited requests now require admin role. - C3: Gate 0 (reklamspärr) added inside
enrichV7()itself — callers that bypass the dispatcher cannot enrich blocked companies. - C4 / H1: when assigning a generated email, set
contact.source = 'generated_guess'to preserve provenance. - C5: HASH_SALT warning fires in all environments, not only production.
High: H2 wires validateCompanyDomain() into the domain guess loop — reachable but unrelated domains rejected via content scoring.
ac8256f — 2026-03-25 — docs: generate updated kundkort and add contact extraction to roadmap
Generated kundkort + roadmap update. Documentation only.
68a59da — 2026-03-25 — chore: remove temporary test companies from benchmark
Housekeeping.
b0122f3 — 2026-03-25 — feat: rebuild kundkort with canonical field spec
Strictly a kundkort change but committed in the same day. The kundkort now has the canonical 18-field spec (orgnr, SNI, hemsida, sammanfattning, land, anställda 2023/2024/2025, växelnummer, kundtjänst-mejl, omsättning, resultat, koncern, ägda varumärken, kontaktpersoner per roll, upphandlingar, källor). GAP cards render when a field has no value. Role normalisation via ROLE_PATTERNS regex array.
3cfa926 — 2026-03-25 — fix: partial enrichment for .com domains + no domain halt
domain.ts— added a 4th Serper query withoutsite:.seso Swedish companies on.comare discovered. Scored.se/.nu= 10,.com= 6 so Swedish TLDs stay preferred.pipeline.ts— removed the domain-required early halt. Maps + LinkedIn + News now run unconditionally so a company without a validated domain still gets phone, address, contacts, growth signals.enrichment_statusonly set todomain_not_foundif all sources return empty.
Concrete win cited in body: “TMP I Uppsala AB now enhanced with phone +46 18-60 72 70 from Google Maps instead of domain_not_found.”
5da08d0 — 2026-03-25 — fix: domain resolution — find .com domains, block junk .se directories
The follow-up: .com discovery from 3cfa926 was being blocked by junk .se directory aggregators winning the .se score-10 slot. Fix:
domain.ts— expanded content-validation slice 3 → 6 so.comcandidates behind junk.seaggregators still reach validation.config.ts— addedallahemsidor.se,bygg.se,byggtjanst.se,byggforetagen.se,hantverkskollen.se,syna.se,vainu.com,bisnode.se,creditsafe.se,soliditet.setoINVALID_DOMAINS.- Result cited: TMP I Uppsala AB now resolves to
tmpab.com(not blocked by aggregator), pulls phone +46 18-60 72 70 andinfo@tmpab.com. Lead score 0 → 4.1.
Significance
This day produced the discovery pipeline that Domain Discovery documents today: cache → IIS registry (Tier 0) → Serper with content validation → blocklist filtering → fall through to Maps/LinkedIn/News if nothing validates.
See also
Domain Discovery, Domain Blocklist, Crawlee Scraper, Lead Scoring, History Overview.