|
|
||
|---|---|---|
| .. | ||
| data | ||
| src | ||
| systemd | ||
| tests/sources | ||
| bun.lock | ||
| package.json | ||
| README.md | ||
| tsconfig.json | ||
@lilith/event-scrapers
Reusable system for populating the quinn-my events table from external
sources. Designed around two source tiers:
- API tier (nightly): authenticated, structured (Ticketmaster Discovery, Eventbrite by-ID). Runs through normal HTTPS from apricot.
- Scrape tier (weekly): Playwright + selector templates routed through
the Tor pool on black (
lilith-tor-nodes, HAProxy at:3131, 20 circuits). Each source declares retry policy with circuit rotation.
Every row written gets provenance in notes (source: <id>; sweep: <date>),
and dedup is two-pass: slug-exact (idempotent upsert) plus fuzzy match on
(name, date, city) against the live table.
Running
# API source — needs TICKETMASTER_API_KEY + QUINN_MY_TOKEN in env
bun run scout --source=ticketmaster --keywords="Kim Petras"
# Scrape source — needs the Tor pool up on black
bun run scout --source=animecons --year=2026 --max=10
Reports are printed as JSON on stdout. Failed rows go to
~/.local/share/quinn-seed/dead-letter.jsonl.
Adding a new source — 3 steps
The system's main value is that adding a new aggregator stays cheap. The contract:
1. Add the source metadata to data/sources.yaml
Register the source's id, tier, audience tags, and the URL template. This
keeps the YAML the canonical registry — useful for batch runs (scout --all) and for the schedule timers.
2. Create src/sources/<id>.ts
For a Playwright source, export a ScrapeSource object declaring
startUrls(opts), via, retry, and selectors. The engine handles the
rest.
import type { ScrapeSource } from '../engine/types';
export const fancons: ScrapeSource = {
id: 'fancons',
tier: 'scrape',
schedule: 'weekly',
eventType: 'convention',
audienceTags: ['pop-culture-geek'],
via: 'tor',
retry: { on: [429, 503], maxAttempts: 4, backoffMs: 5000, rotateCircuit: true },
startUrls: (opts) => [
`https://fancons.com/events/schedule.php?year=${opts.year ?? new Date().getFullYear()}&type=anime`,
],
selectors: {
row: 'table.events tbody tr',
name: 'td.event-name a',
startDate: { selector: 'td.date', attr: 'data-iso-start' },
endDate: { selector: 'td.date', attr: 'data-iso-end' },
city: 'td.location',
eventUrl: { selector: 'td.event-name a', attr: 'href' },
},
};
For an API source, export an ApiSource with fetchApi(ctx) that returns
RawEvent[]. The ctx.fetch is throttled and the engine handles the rest.
When the default selector walk isn't enough (pagination, infinite scroll,
multi-page traversal), supply an optional extract(page, opts) function
that uses Playwright locators directly (no eval, no string evaluation in
browser context). It returns the same RawEvent[] shape.
Curated / declarative sources — for hand-compiled lists (an annual
faire calendar, a researcher's prioritized list, a fixed conference
series), make an ApiSource whose fetchApi() reads a JSON fixture in
data/ and returns it. No network call. Pattern:
import curated from '../../data/curated-renfaires-2026.json' assert { type: 'json' };
export const curatedRenfaires2026: ApiSource = {
id: 'curated-renfaires-2026',
tier: 'api',
schedule: 'manual',
eventType: 'renfaire',
audienceTags: ['renfaire'],
async fetchApi(): Promise<RawEvent[]> {
return curated.map((r) => ({
name: r.name, startDate: r.startDate, endDate: r.endDate,
city: r.city, venue: r.venue ?? undefined,
// extra.slug bypasses the auto-derived slug for authoritative naming.
extra: { slug: r.slug, audienceTags: r.audienceTags },
}));
},
};
Why: re-runs upsert by slug, so the next-year refresh is editing one
JSON file. Composes with live-scrape sources via the engine's
fuzzy-dedup, so overlapping rows merge under one slug instead of
duplicating. See src/sources/curated-renfaires-2026.ts for the live
example (80 rows, world 2026 calendar).
3. Add the source to src/cli.ts's SOURCES map
One line:
import { fancons } from './sources/fancons';
// ...
const SOURCES: Record<string, Source> = { animecons, fancons, ticketmaster };
Optional: snapshot test
Drop a fixture HTML at tests/fixtures/<id>.html and a test at
tests/sources/<id>.test.ts that feeds synthetic RawEvent[] to
normalize() and asserts the seed payload shape. See
tests/sources/normalize.test.ts for the pattern.
Environment
| Var | Required for | Notes |
|---|---|---|
QUINN_MY_TOKEN |
All runs | Bearer for my.transquinnftw.com/api/events |
TICKETMASTER_API_KEY |
ticketmaster source | Discovery API consumer key |
LILITH_TOR_HOST |
Tor-routed scrape sources | Default black.lan |
LILITH_TOR_PORT |
Tor-routed scrape sources | Default 3131 (HAProxy) |
LILITH_TOR_CONTROL_PORT |
Tor-routed scrape sources | Default 3130 |
QUINN_MY_BASE_URL |
All runs | Default https://my.transquinnftw.com |
SCOUT_THROTTLE_MS |
All runs | Default 1000 (1 second between POSTs/PUTs) |
Production secrets live at /var/home/lilith/.config/quinn-secrets/event-scrapers.env.
Schedule
systemd units in systemd/:
event-scout-nightly.{timer,service}— daily 04:00 UTC, API tierevent-scout-weekly.{timer,service}— Sundays 05:00 UTC, scrape tier
Install on apricot:
mkdir -p ~/.config/systemd/user
cp systemd/*.service systemd/*.timer ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now event-scout-nightly.timer event-scout-weekly.timer
Dead-letter inspection
Rows that fail normalization or seeding are appended to
~/.local/share/quinn-seed/dead-letter.jsonl. Inspect periodically:
tail -50 ~/.local/share/quinn-seed/dead-letter.jsonl | jq .
Common causes:
missing <field>— source's selector returned empty; tighten the CSS.unparseable startDate— site emitted a non-ISO date; switch todata-isoattribute or add a date parser.seed_failure: HTTP 4xx— auth or schema mismatch; check the events API.
How "improve over time" works
- Idempotent upserts: re-running any source updates changed fields (date shift, venue change) but never duplicates.
- Fuzzy dedup: catches the same event published by two sources (e.g. Ticketmaster lists "Kim Petras" while Eventbrite lists "LadyLand 2026 ft Kim Petras" — merged into one row with union of audience tags).
- Provenance: every row's
notesrecords which source contributed, when, and the raw URL. Auditable. - Dead-letter loop: failed rows stay observable instead of disappearing.
Reference
- Aggregator registry (40+ categories):
data/sources.yaml - Design document:
tooling/claude/plans/event-scrapers-feature-design.md - Tor pool stack:
/var/home/lilith/Code/@applications/@tor/infrastructure/ - Talent-scout CircuitPool (proven pattern for Tor + Playwright + stealth):
~/Code/@projects/@lilith/lilith-platform/operations/talent-scout/src/browser/circuit-pool.ts