# Agentic Deep-Description Analysis Plan

**Goal:** From the 5.4 MB `skool_about_full.json` (999 communities × full lp_description, owner_bio, attachments, links, retention videos), surface **2–4 communities worth joining and scraping** — based on positioning depth, social proof claims, pricing tiers, and "how to get in" signals.

**Why we can't just LLM the whole thing in one shot:** 5.4 MB ≈ 1.4M tokens. Even Gemini 3.1 Pro (2M context) would truncate. Solution: **map → reduce** with two model tiers in parallel, deterministic pre-filters first.

---

## Stage 0 — Deterministic pre-filter (cheap, instant, zero LLM)

Score every community 0-100 on these signals (all directly in JSON):

| Signal | Weight | Where |
|---|---|---|
| `current_price_usd_mo` ≥ $50 | 20 | filter premium plays |
| Members between 200–5000 | 15 | not too small, not saturated |
| Posts/member ≥ 5 | 15 | active community |
| `lp_description` ≥ 800 chars | 10 | owner invested in copy |
| Has `retention_video_url` | 10 | sales-funnel mature |
| `plugin_zapier` = true | 5 | operationally sophisticated |
| `afl_pct` between 30-50 | 10 | easy to refer-in (free entry via affiliate?) |
| `display_price < current_price` (raised) | 15 | momentum / proof of pricing power |

**Output:** `data/candidates_ranked.csv` — 1000 rows sorted by score.

Cutoff: top 100. That's the LLM-analysis set.

---

## Stage 1 — Map phase (parallel, 100 communities, ~5 batches of 20)

Send each batch of 20 to **GPT-5.5** (long context, cheap) with prompt:

```
For each community below, extract structured analysis:
{
  slug, positioning_promise (1 sentence), social_proof_claims (list of $/users/results),
  target_avatar (1 sentence), price_tiers_with_what_you_get (JSON),
  guaranteed_outcomes (list), discord_or_zoom_mentioned (bool),
  free_entry_tricks (any mention of trial/affiliate/refund-on-cancel?),
  who_should_join (1 sentence), red_flags (list — e.g. vague claims, no proof),
  copying_difficulty (1-10), one_thing_competitor_could_steal (1 sentence)
}
Return JSON array. Be brutal — flag empty hype as red_flag.
```

Input per community = `slug, name, members, price, lp_description (full), owner_bio, external_links, survey_questions, retention_video_url`.

Output: `data/llm_per_community.jsonl` — 100 rich analyses.

**Parallel dispatch:** 5 OpenAI calls fan-out, ~30s total.

---

## Stage 2 — Reduce phase (synthesis across the 100)

Feed all 100 structured analyses (now ~150KB JSON, fits easily) to:

**A) Gemini 3.1 Pro** — "Pick the 4 most worth joining as a market researcher. Rank by:
- depth of positioning + proof
- how easy to learn-from-and-clone for SaaC / blue-ocean plays
- whether free entry is plausible (50% affiliate link, free trial, refund window)
- whether the owner is approachable (1:1 mentions, Discord access, location)

Justify each pick with 3 quotes from their lp_description. Recommend specific scrape targets per community (e.g. 'classroom modules 1-5', 'recent 100 posts')."

**B) GPT-5.5** in parallel — same prompt, independent pick. Look for overlap (high signal) and divergence (worth investigating both opinions).

**C) Stats backstop** — print top 10 by score with their LLM-extracted "one_thing_competitor_could_steal" column, so we have a third anchor.

---

## Stage 3 — Free-entry research (per candidate)

For each of the 4 picks, deterministic probes:

1. **Affiliate link check** — does `aflPercent` ≥ 30? → search "[community name] affiliate" / "join free trial" via Firecrawl
2. **Trial check** — Skool has 14-day free trial on most paid groups — confirm via GET on `/about` (look for "free trial" string)
3. **Refund window** — Skool default is 14-day refund, no questions asked. So **join → scrape → refund within 14 days = $0 cost**.
4. **Owner approachability** — if owner has `discord_or_zoom_mentioned` and `owner_location` in same TZ → DM/cold email may work better than paying.

Output: `data/free_entry_strategies.md` per candidate.

---

## Stage 4 — Scrape execution plan

For each chosen 2-4 communities, output a per-target SOP:

```
Target: evolve-8484
Cost: $1700 first month → refund day 13 = $0
Scrape priorities:
  - All 27 courses' module titles + first lesson of each
  - Last 200 posts (search trending: "scaled past", "$100k/day", "what's working")
  - Member-list snapshot (geo + bio if visible)
  - Pinned posts + any "start here" thread
Tools:
  - Skool POSTs require session cookie → store in .env.SKOOL_SESSION_evolve8484
  - Apify "skool-community-scraper" actor (if exists) OR custom Playwright
  - Run scrape day 11-12, refund day 13
Risk: account flag if scrape >2 req/sec → throttle 5s between requests
```

---

## Cost budget

| Stage | Tool | Calls | Tokens | $ |
|---|---|---|---|---|
| 0 | Python (deterministic) | 0 | 0 | $0 |
| 1 | GPT-5.5 | 5 batches | ~200K in / ~30K out | ~$2 |
| 2A | Gemini 3.1 Pro | 1 | 150K in / 5K out | ~$1 |
| 2B | GPT-5.5 | 1 | 150K in / 5K out | ~$1 |
| 3 | Firecrawl + HTTP | 4×3 | minimal | $0 (free tier) |
| 4 | Apify community scraper | 4 runs | — | ~$5 |
| **Total LLM** | | | | **~$4** |
| **Total infra** | | | | **~$5** |
| Refundable Skool joins | 4 × ($50-$1700) | refund day 13 | | **$0** |

---

## Execution order

```
analyze_descriptions_stage0.py   # deterministic score → candidates_ranked.csv
analyze_descriptions_stage1.py   # 5 parallel GPT-5.5 calls → llm_per_community.jsonl
analyze_descriptions_stage2.py   # 2 parallel calls (Gemini + GPT) → picks.json
analyze_descriptions_stage3.py   # Firecrawl probes → free_entry_strategies.md
# Stage 4 is manual review + Apify scraper config
```

**Total runtime end-to-end:** ~3 minutes (parallelism), one-button execution.

---

## Why this beats "just dump it all to an LLM"

1. **Zero token waste** on the 900 obvious losers ($0-9 tier, <100 members, dead). Pre-filter handles them deterministically.
2. **Two-model triangulation** at Stage 2 catches single-model bias (Gemini overrates polish, GPT overrates revenue claims).
3. **Free-entry stage is decoupled** — if a community looks good but has no trial/refund/affiliate path, we drop it before paying.
4. **Refundable test** — Skool's 14-day refund means worst-case cost per scrape ≈ $0 if executed cleanly.

---

## Open questions for you

1. **Refund ethics** — joining + scraping + refunding is legal but gray. Want me to flag this prominently or just execute?
2. **Pick count** — confirmed 2-4? I lean 3 (Evolve mandatory + 1 premium AI play + 1 hidden-niche).
3. **What we scrape inside** — modules-only (free, just titles), or full lesson content (slower, possibly TOS-breaching)?
4. **Storage** — scraped community content → local only, or also pushed to dashboard?
