diff --git a/docs/retrieve.md b/docs/retrieve.md new file mode 100644 index 00000000..ff89911d --- /dev/null +++ b/docs/retrieve.md @@ -0,0 +1,561 @@ +# Retrieve Pipeline + +The retrieve pipeline is responsible for fetching raw cinema listing data from +websites and APIs. It produces the unprocessed data that the +[transform pipeline](./transform.md) later converts into a standardised format. + +## Architecture + +The retrieve step supports two kinds of modules: + +- **Cinemas** -- individual venues (e.g. `odeon.co.uk-leicester-square`). Each has + its own `retrieve.js`, `transform.js`, and `attributes.js`. +- **Sources** -- external ticketing platforms (e.g. Eventbrite, Dice.fm). These + supply supplementary event data that cinemas can incorporate during transform. + +### Dispatch Flow + +```mermaid +flowchart TD + A[scripts/retrieve/index.js] --> B{Cinema or Source?} + B -->|Cinema| C[cinemas/index.js] + B -->|Source| D[sources/index.js] + C --> E["Load cinema module"] + D --> F["Load source module"] + E --> G["Call retrieve()"] + F --> G + G --> H[Return Raw Data] + H --> I[Transform Pipeline] +``` + +The entry point (`scripts/retrieve/index.js`) resolves the location name to either +a cinema or source module, then calls its `retrieve()` function. The returned data +is opaque to the orchestrator -- each module defines its own structure. + +### The Delegation Pattern + +Most cinemas don't implement retrieval from scratch. Instead, they delegate to a +shared platform module in `common/`, passing venue-specific attributes: + +``` +cinemas/odeon.co.uk-leicester-square/retrieve.js + → common/odeon.co.uk/retrieve.js + → common/ocapi-v1/retrieve.js +``` + +A typical cinema `retrieve.js` looks like this: + +```js +// cinemas/odeon.co.uk-leicester-square/retrieve.js +const attributes = require("./attributes"); +const odeonRetrieve = require("../../common/odeon.co.uk/retrieve"); + +async function retrieve() { + return odeonRetrieve(attributes); +} +``` + +The `attributes.js` file provides venue-specific configuration: + +```js +// cinemas/odeon.co.uk-leicester-square/attributes.js +module.exports = { + id: "odeon.co.uk-leicester-square", + name: "ODEON Luxe Leicester Square", + domain: "https://www.odeon.co.uk", + url: "https://www.odeon.co.uk/cinemas/london-leicester-square", + cinemaId: "153", + // ... address, geo, socials, etc. +}; +``` + +Of the 145 cinemas with retrieve implementations, 112 delegate to one of 17 shared +platforms in `common/`. The remaining ~33 have standalone implementations. + +--- + +## Retrieval Approaches + +### Direct JSON Fetch + +The simplest approach: a single `fetchJson()` call returns all needed data. + +**Platforms:** Electric Cinema (2 venues) + +```js +// common/electriccinema.co.uk/retrieve.js +async function retrieve({ domain }) { + const site = await fetchJson(`${domain}/data/data.json`); + return site; +} +``` + +### Single HTML Page + +Fetches a single HTML page that contains all listing data. No detail page requests +are needed because the listing page has enough information for the transform step. + +**Platforms:** Firmdale Hotels (3 venues) +**Sources:** Stow Film Lounge + +```js +// common/firmdalehotels.com/retrieve.js +async function retrieve({ url }) { + const movieListPage = await fetchText(url); + return { movieListPage }; +} +``` + +### HTML Scraping (List + Detail Pages) + +The most common pattern for standalone cinemas. Fetches a listing page, parses it +with Cheerio to extract links, then fetches each detail page individually. + +**Platforms:** Tate (2 venues), Olympic Studios (3 venues), The Castle Cinema (2 +venues), Admit One (2 venues) +**Standalone:** ~33 cinemas use this pattern with venue-specific selectors +**Sources:** OutSavvy, Wimbledon Film Club + +```js +// cinemas/ica.art/retrieve.js +async function retrieve() { + const movieListPage = await fetchText(url); + const $ = cheerio.load(movieListPage); + + const moviePageUrls = new Set(); + $(".item.films").each(function () { + const url = $(this).children("a").attr("href"); + moviePageUrls.add(`${domain}${url}`); + }); + + const moviePages = {}; + for (const moviePageUrl of [...moviePageUrls]) { + moviePages[moviePageUrl] = await fetchText(moviePageUrl); + } + + return { movieListPage, moviePages }; +} +``` + +Each venue uses different CSS selectors (`.item.films`, `.card-list .card a`, +`.programme-tile`, `.whatson_panel`, etc.) but the fetch-parse-fetch structure is +the same. + +**Variant:** Admit One uses `fetchWin1252Text()` instead of `fetchText()` to handle +legacy Windows-1252 encoded pages. + +### Embedded JSON Extraction + +Fetches an HTML page and extracts structured data from embedded `