From 2e5a07d4c7f8c53a6ad4b32e598cf5986f3045ca Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 11 Feb 2026 10:14:32 +0000 Subject: [PATCH] Add retrieve pipeline documentation Companion to docs/transform.md, documenting how raw data is fetched from cinema websites and APIs. Covers retrieval approaches (HTML scraping, REST APIs, OCAPI, GraphQL, Playwright automation, Gatsby extraction, signed APIs), shared platforms (17 common modules serving 112 venues), sources (9 external ticketing platforms), common utilities, and return data structures. https://claude.ai/code/session_01W5jb9PEZjuL4xzLhHRdfL5 --- docs/retrieve.md | 561 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 561 insertions(+) create mode 100644 docs/retrieve.md diff --git a/docs/retrieve.md b/docs/retrieve.md new file mode 100644 index 00000000..ff89911d --- /dev/null +++ b/docs/retrieve.md @@ -0,0 +1,561 @@ +# Retrieve Pipeline + +The retrieve pipeline is responsible for fetching raw cinema listing data from +websites and APIs. It produces the unprocessed data that the +[transform pipeline](./transform.md) later converts into a standardised format. + +## Architecture + +The retrieve step supports two kinds of modules: + +- **Cinemas** -- individual venues (e.g. `odeon.co.uk-leicester-square`). Each has + its own `retrieve.js`, `transform.js`, and `attributes.js`. +- **Sources** -- external ticketing platforms (e.g. Eventbrite, Dice.fm). These + supply supplementary event data that cinemas can incorporate during transform. + +### Dispatch Flow + +```mermaid +flowchart TD + A[scripts/retrieve/index.js] --> B{Cinema or Source?} + B -->|Cinema| C[cinemas/index.js] + B -->|Source| D[sources/index.js] + C --> E["Load cinema module"] + D --> F["Load source module"] + E --> G["Call retrieve()"] + F --> G + G --> H[Return Raw Data] + H --> I[Transform Pipeline] +``` + +The entry point (`scripts/retrieve/index.js`) resolves the location name to either +a cinema or source module, then calls its `retrieve()` function. The returned data +is opaque to the orchestrator -- each module defines its own structure. + +### The Delegation Pattern + +Most cinemas don't implement retrieval from scratch. Instead, they delegate to a +shared platform module in `common/`, passing venue-specific attributes: + +``` +cinemas/odeon.co.uk-leicester-square/retrieve.js + → common/odeon.co.uk/retrieve.js + → common/ocapi-v1/retrieve.js +``` + +A typical cinema `retrieve.js` looks like this: + +```js +// cinemas/odeon.co.uk-leicester-square/retrieve.js +const attributes = require("./attributes"); +const odeonRetrieve = require("../../common/odeon.co.uk/retrieve"); + +async function retrieve() { + return odeonRetrieve(attributes); +} +``` + +The `attributes.js` file provides venue-specific configuration: + +```js +// cinemas/odeon.co.uk-leicester-square/attributes.js +module.exports = { + id: "odeon.co.uk-leicester-square", + name: "ODEON Luxe Leicester Square", + domain: "https://www.odeon.co.uk", + url: "https://www.odeon.co.uk/cinemas/london-leicester-square", + cinemaId: "153", + // ... address, geo, socials, etc. +}; +``` + +Of the 145 cinemas with retrieve implementations, 112 delegate to one of 17 shared +platforms in `common/`. The remaining ~33 have standalone implementations. + +--- + +## Retrieval Approaches + +### Direct JSON Fetch + +The simplest approach: a single `fetchJson()` call returns all needed data. + +**Platforms:** Electric Cinema (2 venues) + +```js +// common/electriccinema.co.uk/retrieve.js +async function retrieve({ domain }) { + const site = await fetchJson(`${domain}/data/data.json`); + return site; +} +``` + +### Single HTML Page + +Fetches a single HTML page that contains all listing data. No detail page requests +are needed because the listing page has enough information for the transform step. + +**Platforms:** Firmdale Hotels (3 venues) +**Sources:** Stow Film Lounge + +```js +// common/firmdalehotels.com/retrieve.js +async function retrieve({ url }) { + const movieListPage = await fetchText(url); + return { movieListPage }; +} +``` + +### HTML Scraping (List + Detail Pages) + +The most common pattern for standalone cinemas. Fetches a listing page, parses it +with Cheerio to extract links, then fetches each detail page individually. + +**Platforms:** Tate (2 venues), Olympic Studios (3 venues), The Castle Cinema (2 +venues), Admit One (2 venues) +**Standalone:** ~33 cinemas use this pattern with venue-specific selectors +**Sources:** OutSavvy, Wimbledon Film Club + +```js +// cinemas/ica.art/retrieve.js +async function retrieve() { + const movieListPage = await fetchText(url); + const $ = cheerio.load(movieListPage); + + const moviePageUrls = new Set(); + $(".item.films").each(function () { + const url = $(this).children("a").attr("href"); + moviePageUrls.add(`${domain}${url}`); + }); + + const moviePages = {}; + for (const moviePageUrl of [...moviePageUrls]) { + moviePages[moviePageUrl] = await fetchText(moviePageUrl); + } + + return { movieListPage, moviePages }; +} +``` + +Each venue uses different CSS selectors (`.item.films`, `.card-list .card a`, +`.programme-tile`, `.whatson_panel`, etc.) but the fetch-parse-fetch structure is +the same. + +**Variant:** Admit One uses `fetchWin1252Text()` instead of `fetchText()` to handle +legacy Windows-1252 encoded pages. + +### Embedded JSON Extraction + +Fetches an HTML page and extracts structured data from embedded `