Failed Approaches
Learning from what didn't work
Every failed approach is documented here with context on why it failed and what replaced it.
1. Serper.dev for Domain Discovery
When: Early March 2026 What: Used Serper.dev API for Google search (domain discovery, LinkedIn profiles, news signals) Why it failed:
- ToS uncertainty — unclear if commercial B2B enrichment is allowed
- Credit exhaustion — costs scaled unpredictably
- Dependency on third-party Google scraping
Replaced by: Free tier approach
- Google Places API (direct, free tier)
- IIS .se zone registry (1.4M domains, PostgreSQL + pg_trgm)
- DNS/HTTP scoring (custom implementation)
Lesson: Free alternatives exist for core functionality. Only pay for premium APIs when free tier quality is proven insufficient.
2. Playwright-Only Scraping
When: March 8-9, 2026 (v4 → v7 transition) What: Single-browser Playwright extraction for all websites Why it failed:
- Too slow for multi-page sites (15s timeout per page)
- Missed JS-rendered content on React/Wix sites
- Inline in workers = scaling bottleneck
- Vision fallback (Claude) expensive and unreliable
Replaced by: Multi-extractor strategy
- Crawlee (primary) — multi-page, 6 strategies
- Playwright (fallback) — for JS-heavy sites
- Firecrawl (last resort) — LLM-structured extraction
Lesson: No single extractor handles all site types. Layered approach with fallbacks is necessary.
3. Hunter.io for Email Discovery
When: Early development What: Used Hunter.io API to find email patterns by domain Why it failed:
- Expensive at scale
- Limited coverage for Swedish SMEs
- Rate limiting aggressive
Replaced by: SMTP RCPT TO probing
- Generate 4 email patterns from name + domain
- SMTP validate each pattern
- Free, direct, no third-party dependency
Lesson: Protocol-level validation beats API guessing for email verification.
4. Bolagsverket Öppet API for Board Members
When: March 2026 What: Attempted to use Bolagsverket’s open API for board member data Why it failed:
- IP-blocked by Bolagsverket
- Unreliable availability
- Rate limiting unknown
Replaced by: Bolagsverket VärdefullaDatamängder API (authenticated)
- OAuth2 token-based
- Reliable, official API
- Returns name, SNI, address, trade names, business description
Lesson: Official authenticated APIs beat “open” APIs for production reliability.
5. SCB PxWebApi for Real Company Data
When: March 5, 2026 What: Used SCB PxWebApi v2 to fetch company statistics Why it failed:
- Returns statistical aggregates, not real companies
orgNrsynthesized from dimension codes- Name/address are approximations
Current status: Still used for scb_foundations table but with explicit warning in code:
“orgNr/name/address are synthesised from dimension codes, not real company registry records”
Lesson: Statistical APIs ≠ registry APIs. Always verify data source accuracy.
6. Eniro.se for Contact Discovery
When: March 2026 What: Scraped Eniro.se (Swedish Yellow Pages) for contact data Why it failed:
- ToS prohibits automated extraction
- Site structure changed frequently
- Data quality inconsistent
Replaced by: Direct website scraping + Google Places
- More reliable
- No ToS risk
- Better data quality
Lesson: Don’t build on sources with unclear ToS. Domain blocklist now includes Eniro.
7. allabolag.se / ratsit.se / proff.se
When: Never implemented (correctly blocked early) What: Considered scraping Swedish business directories Why blocked:
- Explicit ToS prohibition on commercial extraction
- Legal risk high
- Would violate “legitimate interest” basis
Current status: All 3 domains in INVALID_DOMAINS blocklist (140+ domains total)
Lesson: Legal review before adding any data source. ToS matters for GDPR Art. 6(1)(f).
8. LinkedIn Search Integration
When: March 2026 What: Planned to use LinkedIn for public profile discovery Why it failed:
- LinkedIn aggressively blocks scraping
- ToS explicitly prohibits automated access
- No reliable API for public profiles
Current status: linkedInSearch() returns empty array. Code exists but disabled.
Lesson: Some data sources are simply inaccessible. Don’t waste engineering time on them.
9. Single “God File” Enrichment Engine
When: March 24, 2026
What: enrichmentEngine.v7.ts — 1000+ line monolith
Why it failed:
- Unmaintainable
- No unit test coverage
- Changes risky (affects everything)
Replaced by: Modular decomposition (Phase 2)
src/enrichment/pipeline.ts— orchestrationsrc/enrichment/sources/— data sourcessrc/enrichment/processors/— post-processingsrc/extractors/— scraping extractors
Lesson: Decompose early. God files become unmaintainable faster than you think.
10. Mock Validation in Production
When: Ongoing (P0 bug)
What: src/mocks/validation.ts returns Math.random() > 0.5 for all 4 validation layers
Why it’s wrong:
- Imported by production code
is_validatedfield is meaningless- No actual validation happening
Status: Still present. Real validation logic exists (Bolagsverket API, SMTP probes) but not wired up.
Lesson: Mocks should NEVER be imported in production paths. Separate test mocks from production stubs.
11. Path 3B — Ephemeral PDF Download + LLM Extraction
When: May 5, 2026 (one-day discovery on discovery/path-3b-pdf-llm-extraction, tagged discovery/path-3b-final)
What: Surface skall-krav, references, SLA, security clearance, staff CV requirements (the bid-decision fields TED structured data does NOT carry) by ephemerally downloading buyer förfrågningsunderlag PDFs (TTL ≤60s) and running structured LLM extraction with verbatim source quotes per field.
Why it failed — three independent dead-ends:
-
No fetchable PDFs in the SE TED corpus. Host inventory of 670 ingested SE notices: 100% route to login-walled commercial platforms (
tendsign.com436,e-avrop.com225,kommersannons.se162,clira.io72,upphandling.trafikverket.se52, others 6). e-Avrop confirmed 2-step auth in JS (Logga in med tvåstegsverifiering).BT-15eFormsCallForTendersDocumentReferencepoints back to platform landing pages, not direct PDFs. Hard NO-GO trigger #2 from the legal gate fired. -
Buyer-self-hosted subset does not exist at usable volume. Probed buyer org-domains directly to bypass the platform layer (operator pivot 3b). Stopped at N=26/50 once the structural pattern was unambiguous: 0% genuine tender-PDF hit rate against a locked threshold of <10% = abandon. The 3 PDFs found were generic governance docs (Inköpspolicy, Uppförandekod, Upphandlingsplan). 7/26 buyers (27%) explicitly redirect to commercial platforms even when they have a procurement landing page on their own domain.
-
TF 2 kap 12 § email queue (operator pivot 3c) eliminates real-time intel claim. The PDF is guaranteed allmän handling, but 2-5 business day lead time per request makes it post-award research, not pre-bid intelligence — a different product, parked for long-term roadmap.
QA gates that PASSED before sourcing failed:
- Technical (Reality Checker, default-NO 9-condition rubric): solvable in principle with verbatim source quotes + temperature=0 + per-field confidence + 100% human verification on 6 fixtures
- Legal (Compliance Checker): GO-WITH-MITIGATIONS under URL 9 § (offentlighetsprincipen), URL 15 c § (DSM TDM), GDPR Art 6(1)(f), AI Act Art 50
Replaced by: Operator pivot 3a — accept the gap. Position product as “TED intelligence + Layer 2 enrichment,” explicitly NOT bid-decision-support.
Lessons:
- Probe sourcing BEFORE designing extraction. Both QA gates passed on the unverified assumption that fetchable PDFs existed.
- Hardened metrics matter. v1 probe showed 40% any-PDF hit rate — looked like a green light. Filtering generic policy/governance PDFs from tender-specific förfrågningsunderlag reversed the signal to 0%.
- Once 3+ buyers redirect to the same commercial platform, the structural pattern is locked — could have stopped at N=10 with the same conclusion.
- Pre-articulated hard NO-GO triggers (legal gate trigger #2 = “login-walled PDFs where bypassing requires credential reuse”) are decisive.
Artifacts: preserved on tag discovery/path-3b-final: docs/discovery/OUTCOME.md, docs/discovery/PATH_3B_PDF_LLM_EXTRACTION.md, docs/discovery/probe-n50-results.csv, scripts/probe-self-hosted-pdfs.ts.
Related: Technical Debt, Notable Commits, Article 14.