History Crawlee Era

Scope

2026-04-01 → 2026-04-02. Four commits introduce Crawlee Scraper and the autonomous quality-improvement loop.

The "29 quality-loop rounds" referenced everywhere are NOT 29 git commits. They are 29 documented iterations recorded in docs/superpowers/PROGRESS.md (in the repo). All 29 collapsed into the two f6c4d30 / 23d7207 commits on 2026-04-02.

Commits

2324c1f — 2026-04-01 — feat(scraper): add Crawlee-based multi-page website scraper

The first Crawlee commit. Substantive new scraper at src/enrichment/sources/crawlee.ts.

PlaywrightCrawler with MemoryStorage — no files written to disk.
Crawls homepage + contact / about / team sub-pages concurrently (max 8 pages — later raised to 12; see Crawlee Scraper).
Text extraction: name-on-own-line plus look-ahead for role / email / phone.
Alt-text extraction: team/ path images parsed for “Name role Company” alt:
- Handles URL-encoded Next.js image paths (%2Fteam%2F).
- Filters generic alt words (About, Images, Blog).
- Accent-normalised dedup (André == Andre).
Social link extraction from HTML.
Scroll to trigger lazy-loaded content.

Wired into website.ts: USE_CRAWLEE=true env flag routes here. USE_FIRECRAWL takes priority if both set.

Config additions:

LOCATION_TERMINATING_WORDS: electronics, solutions, technologies, consulting.
INVALID_NAME_STANDALONE_WORDS: aktuellt, nyheter, tjänster, produkter, erbjudanden, karriär.

cb91206 — 2026-04-02 — autoresearch: add autonomous extraction improvement system

Built the loop that ran the 29 quality rounds.

Added JSON-LD structured-data extraction (+18.7 composite score). See JSON-LD Extraction.
Extracts schema.org Person, Organization, ContactPoint types.
Cited deltas: extraction rate 60% → 80%; false-positive rate 37.5% → 6.3%.

New files:

autoresearch/program.md — agent instructions
autoresearch/experiment.ts — experiment runner
autoresearch/metrics.ts — quality metrics
autoresearch/analyze.ts — results analyzer
autoresearch/loop.ts — autonomous loop
autoresearch/companies.json — test companies
autoresearch/README.md — usage

See Autoresearch Loop.

23d7207 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested

f6c4d30 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested

23d7207 and f6c4d30 are duplicate-message commits 17 minutes apart. Same body. Net-of-deletions diff (f6c4d30): 1356 deletions, 0 additions — it removed dead test scripts, the staged frontend dist/, two overstory specs, .DS_Store, and .claude/settings.local.json. The substantive Crawlee changes landed in 23d7207; f6c4d30 is the cleanup.

Cited outcomes (body):

40+ fixes applied across domain discovery, Crawlee, contact extraction.
0 false positives in contact extraction — every contact is a real person.
~60% contact extraction rate for companies with live websites.
Policy-compliant: robots.txt respected, public data only, no TOS violations.
Serper removed from fallbacks (TOS concerns + credits exhausted).
Policy-compliant fallbacks now: BV Öppet API + VärdefullaDatamängder only.
82/82 unit tests passing.
Body claim: “production-ready for bulk enrichment of ~650K active AB companies.”

Files modified (per commit message):

src/enrichment/sources/crawlee.ts — SSL fix, www / non-www handling, purgeOnStart, blocklists.
src/enrichment/sources/domain.ts — parked-page detection, generic-domain rejection, city context.
src/enrichment/config.ts — 790 lines of blocklists.
src/enrichment/processors/nameUtils.ts — Swedish name validation, role normalization.
test-loop-crawlee.ts — iterative quality loop test with fallback.
docs/superpowers/PROGRESS.md — the 29 rounds documented.
docs/superpowers/FINAL_REPORT.md — readiness assessment.
docs/superpowers/CRAWLEE_RESEARCH_PROGRAM.md — research program doc.

Body claims and verifiable claims diverge — the message says 0 false positives across 145 tested companies, but the autoresearch composite metric is what was actually optimised. See Experiment Results for the actual round-by-round numbers.

Significance

Crawlee replaced Playwright as the primary scraper.
Serper was removed from the fallback chain for TOS reasons. The current scraper relies on direct site crawl + the two BV APIs.
The autoresearch loop is the project’s first autonomous improvement system. It still runs from autoresearch/ (untracked test runs in autoresearch/results/).

EnrichNode Wiki

Explorer

Scope

Commits

2324c1f — 2026-04-01 — feat(scraper): add Crawlee-based multi-page website scraper

cb91206 — 2026-04-02 — autoresearch: add autonomous extraction improvement system

23d7207 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested

f6c4d30 — 2026-04-02 — feat: Crawlee quality loop complete — 29 rounds, 145 companies tested

Significance

See also

See also

Graph View

Table of Contents

Backlinks