lilith-platform.live/codebase/@features/event-scrapers
2026-05-17 02:49:08 -07:00
..
data feat(event-scrapers): Add curated Renfaires 2026 data source, update normalization engine, seed client, and enhance test coverage 2026-05-17 02:49:08 -07:00
src feat(event-scrapers): Add curated Renfaires 2026 data source, update normalization engine, seed client, and enhance test coverage 2026-05-17 02:49:08 -07:00
systemd infra(event-scrapers): 🧱 Update systemd service and demo data seeding script for nightly event scraper and London demo environment 2026-05-16 20:39:09 -07:00
tests/sources feat(event-scrapers): Add curated Renfaires 2026 data source, update normalization engine, seed client, and enhance test coverage 2026-05-17 02:49:08 -07:00
bun.lock feat(event-scrapers): Introduce deduplication and normalization logic alongside AnimeCons and Ticketmaster event source integrations 2026-05-16 19:26:05 -07:00
package.json
README.md feat(event-scrapers): Add curated "renfaires-2026" event scraper with normalization, CLI integration, and test coverage 2026-05-17 00:48:18 -07:00
tsconfig.json

@lilith/event-scrapers

Reusable system for populating the quinn-my events table from external sources. Designed around two source tiers:

  • API tier (nightly): authenticated, structured (Ticketmaster Discovery, Eventbrite by-ID). Runs through normal HTTPS from apricot.
  • Scrape tier (weekly): Playwright + selector templates routed through the Tor pool on black (lilith-tor-nodes, HAProxy at :3131, 20 circuits). Each source declares retry policy with circuit rotation.

Every row written gets provenance in notes (source: <id>; sweep: <date>), and dedup is two-pass: slug-exact (idempotent upsert) plus fuzzy match on (name, date, city) against the live table.

Running

# API source — needs TICKETMASTER_API_KEY + QUINN_MY_TOKEN in env
bun run scout --source=ticketmaster --keywords="Kim Petras"

# Scrape source — needs the Tor pool up on black
bun run scout --source=animecons --year=2026 --max=10

Reports are printed as JSON on stdout. Failed rows go to ~/.local/share/quinn-seed/dead-letter.jsonl.

Adding a new source — 3 steps

The system's main value is that adding a new aggregator stays cheap. The contract:

1. Add the source metadata to data/sources.yaml

Register the source's id, tier, audience tags, and the URL template. This keeps the YAML the canonical registry — useful for batch runs (scout --all) and for the schedule timers.

2. Create src/sources/<id>.ts

For a Playwright source, export a ScrapeSource object declaring startUrls(opts), via, retry, and selectors. The engine handles the rest.

import type { ScrapeSource } from '../engine/types';

export const fancons: ScrapeSource = {
  id: 'fancons',
  tier: 'scrape',
  schedule: 'weekly',
  eventType: 'convention',
  audienceTags: ['pop-culture-geek'],
  via: 'tor',
  retry: { on: [429, 503], maxAttempts: 4, backoffMs: 5000, rotateCircuit: true },
  startUrls: (opts) => [
    `https://fancons.com/events/schedule.php?year=${opts.year ?? new Date().getFullYear()}&type=anime`,
  ],
  selectors: {
    row: 'table.events tbody tr',
    name: 'td.event-name a',
    startDate: { selector: 'td.date', attr: 'data-iso-start' },
    endDate: { selector: 'td.date', attr: 'data-iso-end' },
    city: 'td.location',
    eventUrl: { selector: 'td.event-name a', attr: 'href' },
  },
};

For an API source, export an ApiSource with fetchApi(ctx) that returns RawEvent[]. The ctx.fetch is throttled and the engine handles the rest.

When the default selector walk isn't enough (pagination, infinite scroll, multi-page traversal), supply an optional extract(page, opts) function that uses Playwright locators directly (no eval, no string evaluation in browser context). It returns the same RawEvent[] shape.

Curated / declarative sources — for hand-compiled lists (an annual faire calendar, a researcher's prioritized list, a fixed conference series), make an ApiSource whose fetchApi() reads a JSON fixture in data/ and returns it. No network call. Pattern:

import curated from '../../data/curated-renfaires-2026.json' assert { type: 'json' };

export const curatedRenfaires2026: ApiSource = {
  id: 'curated-renfaires-2026',
  tier: 'api',
  schedule: 'manual',
  eventType: 'renfaire',
  audienceTags: ['renfaire'],
  async fetchApi(): Promise<RawEvent[]> {
    return curated.map((r) => ({
      name: r.name, startDate: r.startDate, endDate: r.endDate,
      city: r.city, venue: r.venue ?? undefined,
      // extra.slug bypasses the auto-derived slug for authoritative naming.
      extra: { slug: r.slug, audienceTags: r.audienceTags },
    }));
  },
};

Why: re-runs upsert by slug, so the next-year refresh is editing one JSON file. Composes with live-scrape sources via the engine's fuzzy-dedup, so overlapping rows merge under one slug instead of duplicating. See src/sources/curated-renfaires-2026.ts for the live example (80 rows, world 2026 calendar).

3. Add the source to src/cli.ts's SOURCES map

One line:

import { fancons } from './sources/fancons';
// ...
const SOURCES: Record<string, Source> = { animecons, fancons, ticketmaster };

Optional: snapshot test

Drop a fixture HTML at tests/fixtures/<id>.html and a test at tests/sources/<id>.test.ts that feeds synthetic RawEvent[] to normalize() and asserts the seed payload shape. See tests/sources/normalize.test.ts for the pattern.

Environment

Var Required for Notes
QUINN_MY_TOKEN All runs Bearer for my.transquinnftw.com/api/events
TICKETMASTER_API_KEY ticketmaster source Discovery API consumer key
LILITH_TOR_HOST Tor-routed scrape sources Default black.lan
LILITH_TOR_PORT Tor-routed scrape sources Default 3131 (HAProxy)
LILITH_TOR_CONTROL_PORT Tor-routed scrape sources Default 3130
QUINN_MY_BASE_URL All runs Default https://my.transquinnftw.com
SCOUT_THROTTLE_MS All runs Default 1000 (1 second between POSTs/PUTs)

Production secrets live at /var/home/lilith/.config/quinn-secrets/event-scrapers.env.

Schedule

systemd units in systemd/:

  • event-scout-nightly.{timer,service} — daily 04:00 UTC, API tier
  • event-scout-weekly.{timer,service} — Sundays 05:00 UTC, scrape tier

Install on apricot:

mkdir -p ~/.config/systemd/user
cp systemd/*.service systemd/*.timer ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now event-scout-nightly.timer event-scout-weekly.timer

Dead-letter inspection

Rows that fail normalization or seeding are appended to ~/.local/share/quinn-seed/dead-letter.jsonl. Inspect periodically:

tail -50 ~/.local/share/quinn-seed/dead-letter.jsonl | jq .

Common causes:

  • missing <field> — source's selector returned empty; tighten the CSS.
  • unparseable startDate — site emitted a non-ISO date; switch to data-iso attribute or add a date parser.
  • seed_failure: HTTP 4xx — auth or schema mismatch; check the events API.

How "improve over time" works

  • Idempotent upserts: re-running any source updates changed fields (date shift, venue change) but never duplicates.
  • Fuzzy dedup: catches the same event published by two sources (e.g. Ticketmaster lists "Kim Petras" while Eventbrite lists "LadyLand 2026 ft Kim Petras" — merged into one row with union of audience tags).
  • Provenance: every row's notes records which source contributed, when, and the raw URL. Auditable.
  • Dead-letter loop: failed rows stay observable instead of disappearing.

Reference

  • Aggregator registry (40+ categories): data/sources.yaml
  • Design document: tooling/claude/plans/event-scrapers-feature-design.md
  • Tor pool stack: /var/home/lilith/Code/@applications/@tor/infrastructure/
  • Talent-scout CircuitPool (proven pattern for Tor + Playwright + stealth): ~/Code/@projects/@lilith/lilith-platform/operations/talent-scout/src/browser/circuit-pool.ts