Scope

2026-04-01 → 2026-04-02. Four commits introduce Crawlee Scraper and the autonomous quality-improvement loop.

The "29 quality-loop rounds" referenced everywhere are NOT 29 git commits. They are 29 documented iterations recorded in docs/superpowers/PROGRESS.md (in the repo). All 29 collapsed into the two f6c4d30 / 23d7207 commits on 2026-04-02.

Commits

2324c1f — 2026-04-01 — feat(scraper): add Crawlee-based multi-page website scraper

The first Crawlee commit. Substantive new scraper at src/enrichment/sources/crawlee.ts.

  • PlaywrightCrawler with MemoryStorage — no files written to disk.
  • Crawls homepage + contact / about / team sub-pages concurrently (max 8 pages — later raised to 12; see Crawlee Scraper).
  • Text extraction: name-on-own-line plus look-ahead for role / email / phone.
  • Alt-text extraction: team/ path images parsed for “Name role Company” alt:
    • Handles URL-encoded Next.js image paths (%2Fteam%2F).
    • Filters generic alt words (About, Images, Blog).
    • Accent-normalised dedup (André == Andre).
  • Social link extraction from HTML.
  • Scroll to trigger lazy-loaded content.

Wired into website.ts: USE_CRAWLEE=true env flag routes here. USE_FIRECRAWL takes priority if both set.

Config additions:

  • LOCATION_TERMINATING_WORDS: electronics, solutions, technologies, consulting.
  • INVALID_NAME_STANDALONE_WORDS: aktuellt, nyheter, tjänster, produkter, erbjudanden, karriär.

cb91206 — 2026-04-02 — autoresearch: add autonomous extraction improvement system

Built the loop that ran the 29 quality rounds.

  • Added JSON-LD structured-data extraction (+18.7 composite score). See JSON-LD Extraction.
  • Extracts schema.org Person, Organization, ContactPoint types.
  • Cited deltas: extraction rate 60% → 80%; false-positive rate 37.5% → 6.3%.

New files:

  • autoresearch/program.md — agent instructions
  • autoresearch/experiment.ts — experiment runner
  • autoresearch/metrics.ts — quality metrics
  • autoresearch/analyze.ts — results analyzer
  • autoresearch/loop.ts — autonomous loop
  • autoresearch/companies.json — test companies
  • autoresearch/README.md — usage

See Autoresearch Loop.

23d7207 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested

f6c4d30 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested

23d7207 and f6c4d30 are duplicate-message commits 17 minutes apart. Same body. Net-of-deletions diff (f6c4d30): 1356 deletions, 0 additions — it removed dead test scripts, the staged frontend dist/, two overstory specs, .DS_Store, and .claude/settings.local.json. The substantive Crawlee changes landed in 23d7207; f6c4d30 is the cleanup.

Cited outcomes (body):

  • 40+ fixes applied across domain discovery, Crawlee, contact extraction.
  • 0 false positives in contact extraction — every contact is a real person.
  • ~60% contact extraction rate for companies with live websites.
  • Policy-compliant: robots.txt respected, public data only, no TOS violations.
  • Serper removed from fallbacks (TOS concerns + credits exhausted).
  • Policy-compliant fallbacks now: BV Öppet API + VärdefullaDatamängder only.
  • 82/82 unit tests passing.
  • Body claim: “production-ready for bulk enrichment of ~650K active AB companies.”

Files modified (per commit message):

  • src/enrichment/sources/crawlee.ts — SSL fix, www / non-www handling, purgeOnStart, blocklists.
  • src/enrichment/sources/domain.ts — parked-page detection, generic-domain rejection, city context.
  • src/enrichment/config.ts — 790 lines of blocklists.
  • src/enrichment/processors/nameUtils.ts — Swedish name validation, role normalization.
  • test-loop-crawlee.ts — iterative quality loop test with fallback.
  • docs/superpowers/PROGRESS.md — the 29 rounds documented.
  • docs/superpowers/FINAL_REPORT.md — readiness assessment.
  • docs/superpowers/CRAWLEE_RESEARCH_PROGRAM.md — research program doc.

Body claims and verifiable claims diverge — the message says 0 false positives across 145 tested companies, but the autoresearch composite metric is what was actually optimised. See Experiment Results for the actual round-by-round numbers.

Significance

  • Crawlee replaced Playwright as the primary scraper.
  • Serper was removed from the fallback chain for TOS reasons. The current scraper relies on direct site crawl + the two BV APIs.
  • The autoresearch loop is the project’s first autonomous improvement system. It still runs from autoresearch/ (untracked test runs in autoresearch/results/).

See also

Crawlee Scraper, Autoresearch Loop, JSON-LD Extraction, Experiment Results, History Overview.

See also