Failed Approaches

Learning from what didn't work

Every failed approach is documented here with context on why it failed and what replaced it.

1. Serper.dev for Domain Discovery

When: Early March 2026 What: Used Serper.dev API for Google search (domain discovery, LinkedIn profiles, news signals) Why it failed:

  • ToS uncertainty — unclear if commercial B2B enrichment is allowed
  • Credit exhaustion — costs scaled unpredictably
  • Dependency on third-party Google scraping

Replaced by: Free tier approach

  • Google Places API (direct, free tier)
  • IIS .se zone registry (1.4M domains, PostgreSQL + pg_trgm)
  • DNS/HTTP scoring (custom implementation)

Lesson: Free alternatives exist for core functionality. Only pay for premium APIs when free tier quality is proven insufficient.

2. Playwright-Only Scraping

When: March 8-9, 2026 (v4 → v7 transition) What: Single-browser Playwright extraction for all websites Why it failed:

  • Too slow for multi-page sites (15s timeout per page)
  • Missed JS-rendered content on React/Wix sites
  • Inline in workers = scaling bottleneck
  • Vision fallback (Claude) expensive and unreliable

Replaced by: Multi-extractor strategy

  • Crawlee (primary) — multi-page, 6 strategies
  • Playwright (fallback) — for JS-heavy sites
  • Firecrawl (last resort) — LLM-structured extraction

Lesson: No single extractor handles all site types. Layered approach with fallbacks is necessary.

3. Hunter.io for Email Discovery

When: Early development What: Used Hunter.io API to find email patterns by domain Why it failed:

  • Expensive at scale
  • Limited coverage for Swedish SMEs
  • Rate limiting aggressive

Replaced by: SMTP RCPT TO probing

  • Generate 4 email patterns from name + domain
  • SMTP validate each pattern
  • Free, direct, no third-party dependency

Lesson: Protocol-level validation beats API guessing for email verification.

4. Bolagsverket Öppet API for Board Members

When: March 2026 What: Attempted to use Bolagsverket’s open API for board member data Why it failed:

  • IP-blocked by Bolagsverket
  • Unreliable availability
  • Rate limiting unknown

Replaced by: Bolagsverket VärdefullaDatamängder API (authenticated)

  • OAuth2 token-based
  • Reliable, official API
  • Returns name, SNI, address, trade names, business description

Lesson: Official authenticated APIs beat “open” APIs for production reliability.

5. SCB PxWebApi for Real Company Data

When: March 5, 2026 What: Used SCB PxWebApi v2 to fetch company statistics Why it failed:

  • Returns statistical aggregates, not real companies
  • orgNr synthesized from dimension codes
  • Name/address are approximations

Current status: Still used for scb_foundations table but with explicit warning in code:

“orgNr/name/address are synthesised from dimension codes, not real company registry records”

Lesson: Statistical APIs ≠ registry APIs. Always verify data source accuracy.

6. Eniro.se for Contact Discovery

When: March 2026 What: Scraped Eniro.se (Swedish Yellow Pages) for contact data Why it failed:

  • ToS prohibits automated extraction
  • Site structure changed frequently
  • Data quality inconsistent

Replaced by: Direct website scraping + Google Places

  • More reliable
  • No ToS risk
  • Better data quality

Lesson: Don’t build on sources with unclear ToS. Domain blocklist now includes Eniro.

7. allabolag.se / ratsit.se / proff.se

When: Never implemented (correctly blocked early) What: Considered scraping Swedish business directories Why blocked:

  • Explicit ToS prohibition on commercial extraction
  • Legal risk high
  • Would violate “legitimate interest” basis

Current status: All 3 domains in INVALID_DOMAINS blocklist (140+ domains total)

Lesson: Legal review before adding any data source. ToS matters for GDPR Art. 6(1)(f).

8. LinkedIn Search Integration

When: March 2026 What: Planned to use LinkedIn for public profile discovery Why it failed:

  • LinkedIn aggressively blocks scraping
  • ToS explicitly prohibits automated access
  • No reliable API for public profiles

Current status: linkedInSearch() returns empty array. Code exists but disabled.

Lesson: Some data sources are simply inaccessible. Don’t waste engineering time on them.

9. Single “God File” Enrichment Engine

When: March 24, 2026 What: enrichmentEngine.v7.ts — 1000+ line monolith Why it failed:

  • Unmaintainable
  • No unit test coverage
  • Changes risky (affects everything)

Replaced by: Modular decomposition (Phase 2)

  • src/enrichment/pipeline.ts — orchestration
  • src/enrichment/sources/ — data sources
  • src/enrichment/processors/ — post-processing
  • src/extractors/ — scraping extractors

Lesson: Decompose early. God files become unmaintainable faster than you think.

10. Mock Validation in Production

When: Ongoing (P0 bug) What: src/mocks/validation.ts returns Math.random() > 0.5 for all 4 validation layers Why it’s wrong:

  • Imported by production code
  • is_validated field is meaningless
  • No actual validation happening

Status: Still present. Real validation logic exists (Bolagsverket API, SMTP probes) but not wired up.

Lesson: Mocks should NEVER be imported in production paths. Separate test mocks from production stubs.

11. Path 3B — Ephemeral PDF Download + LLM Extraction

When: May 5, 2026 (one-day discovery on discovery/path-3b-pdf-llm-extraction, tagged discovery/path-3b-final)

What: Surface skall-krav, references, SLA, security clearance, staff CV requirements (the bid-decision fields TED structured data does NOT carry) by ephemerally downloading buyer förfrågningsunderlag PDFs (TTL ≤60s) and running structured LLM extraction with verbatim source quotes per field.

Why it failed — three independent dead-ends:

  1. No fetchable PDFs in the SE TED corpus. Host inventory of 670 ingested SE notices: 100% route to login-walled commercial platforms (tendsign.com 436, e-avrop.com 225, kommersannons.se 162, clira.io 72, upphandling.trafikverket.se 52, others 6). e-Avrop confirmed 2-step auth in JS (Logga in med tvåstegsverifiering). BT-15 eForms CallForTendersDocumentReference points back to platform landing pages, not direct PDFs. Hard NO-GO trigger #2 from the legal gate fired.

  2. Buyer-self-hosted subset does not exist at usable volume. Probed buyer org-domains directly to bypass the platform layer (operator pivot 3b). Stopped at N=26/50 once the structural pattern was unambiguous: 0% genuine tender-PDF hit rate against a locked threshold of <10% = abandon. The 3 PDFs found were generic governance docs (Inköpspolicy, Uppförandekod, Upphandlingsplan). 7/26 buyers (27%) explicitly redirect to commercial platforms even when they have a procurement landing page on their own domain.

  3. TF 2 kap 12 § email queue (operator pivot 3c) eliminates real-time intel claim. The PDF is guaranteed allmän handling, but 2-5 business day lead time per request makes it post-award research, not pre-bid intelligence — a different product, parked for long-term roadmap.

QA gates that PASSED before sourcing failed:

  • Technical (Reality Checker, default-NO 9-condition rubric): solvable in principle with verbatim source quotes + temperature=0 + per-field confidence + 100% human verification on 6 fixtures
  • Legal (Compliance Checker): GO-WITH-MITIGATIONS under URL 9 § (offentlighetsprincipen), URL 15 c § (DSM TDM), GDPR Art 6(1)(f), AI Act Art 50

Replaced by: Operator pivot 3a — accept the gap. Position product as “TED intelligence + Layer 2 enrichment,” explicitly NOT bid-decision-support.

Lessons:

  • Probe sourcing BEFORE designing extraction. Both QA gates passed on the unverified assumption that fetchable PDFs existed.
  • Hardened metrics matter. v1 probe showed 40% any-PDF hit rate — looked like a green light. Filtering generic policy/governance PDFs from tender-specific förfrågningsunderlag reversed the signal to 0%.
  • Once 3+ buyers redirect to the same commercial platform, the structural pattern is locked — could have stopped at N=10 with the same conclusion.
  • Pre-articulated hard NO-GO triggers (legal gate trigger #2 = “login-walled PDFs where bypassing requires credential reuse”) are decisive.

Artifacts: preserved on tag discovery/path-3b-final: docs/discovery/OUTCOME.md, docs/discovery/PATH_3B_PDF_LLM_EXTRACTION.md, docs/discovery/probe-n50-results.csv, scripts/probe-self-hosted-pdfs.ts.

Related: Technical Debt, Notable Commits, Article 14.