Scope
2026-04-01 → 2026-04-02. Four commits introduce Crawlee Scraper and the autonomous quality-improvement loop.
The "29 quality-loop rounds" referenced everywhere are NOT 29 git commits. They are 29 documented iterations recorded in
docs/superpowers/PROGRESS.md(in the repo). All 29 collapsed into the twof6c4d30/23d7207commits on 2026-04-02.
Commits
2324c1f — 2026-04-01 — feat(scraper): add Crawlee-based multi-page website scraper
The first Crawlee commit. Substantive new scraper at src/enrichment/sources/crawlee.ts.
PlaywrightCrawlerwithMemoryStorage— no files written to disk.- Crawls homepage + contact / about / team sub-pages concurrently (max 8 pages — later raised to 12; see Crawlee Scraper).
- Text extraction: name-on-own-line plus look-ahead for role / email / phone.
- Alt-text extraction:
team/path images parsed for “Name role Company” alt:- Handles URL-encoded Next.js image paths (
%2Fteam%2F). - Filters generic alt words (About, Images, Blog).
- Accent-normalised dedup (André == Andre).
- Handles URL-encoded Next.js image paths (
- Social link extraction from HTML.
- Scroll to trigger lazy-loaded content.
Wired into website.ts: USE_CRAWLEE=true env flag routes here. USE_FIRECRAWL takes priority if both set.
Config additions:
LOCATION_TERMINATING_WORDS: electronics, solutions, technologies, consulting.INVALID_NAME_STANDALONE_WORDS: aktuellt, nyheter, tjänster, produkter, erbjudanden, karriär.
cb91206 — 2026-04-02 — autoresearch: add autonomous extraction improvement system
Built the loop that ran the 29 quality rounds.
- Added JSON-LD structured-data extraction (+18.7 composite score). See JSON-LD Extraction.
- Extracts schema.org Person, Organization, ContactPoint types.
- Cited deltas: extraction rate 60% → 80%; false-positive rate 37.5% → 6.3%.
New files:
autoresearch/program.md— agent instructionsautoresearch/experiment.ts— experiment runnerautoresearch/metrics.ts— quality metricsautoresearch/analyze.ts— results analyzerautoresearch/loop.ts— autonomous loopautoresearch/companies.json— test companiesautoresearch/README.md— usage
See Autoresearch Loop.
23d7207 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested
f6c4d30 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested
23d7207 and f6c4d30 are duplicate-message commits 17 minutes apart. Same body. Net-of-deletions diff (f6c4d30): 1356 deletions, 0 additions — it removed dead test scripts, the staged frontend dist/, two overstory specs, .DS_Store, and .claude/settings.local.json. The substantive Crawlee changes landed in 23d7207; f6c4d30 is the cleanup.
Cited outcomes (body):
- 40+ fixes applied across domain discovery, Crawlee, contact extraction.
- 0 false positives in contact extraction — every contact is a real person.
- ~60% contact extraction rate for companies with live websites.
- Policy-compliant:
robots.txtrespected, public data only, no TOS violations. - Serper removed from fallbacks (TOS concerns + credits exhausted).
- Policy-compliant fallbacks now: BV Öppet API + VärdefullaDatamängder only.
- 82/82 unit tests passing.
- Body claim: “production-ready for bulk enrichment of ~650K active AB companies.”
Files modified (per commit message):
src/enrichment/sources/crawlee.ts— SSL fix, www / non-www handling,purgeOnStart, blocklists.src/enrichment/sources/domain.ts— parked-page detection, generic-domain rejection, city context.src/enrichment/config.ts— 790 lines of blocklists.src/enrichment/processors/nameUtils.ts— Swedish name validation, role normalization.test-loop-crawlee.ts— iterative quality loop test with fallback.docs/superpowers/PROGRESS.md— the 29 rounds documented.docs/superpowers/FINAL_REPORT.md— readiness assessment.docs/superpowers/CRAWLEE_RESEARCH_PROGRAM.md— research program doc.
Body claims and verifiable claims diverge — the message says 0 false positives across 145 tested companies, but the autoresearch composite metric is what was actually optimised. See Experiment Results for the actual round-by-round numbers.
Significance
- Crawlee replaced Playwright as the primary scraper.
- Serper was removed from the fallback chain for TOS reasons. The current scraper relies on direct site crawl + the two BV APIs.
- The autoresearch loop is the project’s first autonomous improvement system. It still runs from
autoresearch/(untracked test runs inautoresearch/results/).
See also
Crawlee Scraper, Autoresearch Loop, JSON-LD Extraction, Experiment Results, History Overview.