Common scripts for managing cinema data
-
Install dependencies:
npm install -
Copy
.env.exampleto.envand configure the following environment variables:Variable Description MOVIEDB_API_KEYAPI key from The Movie Database GEMINI_API_KEYAPI key from Google AI Studio PATGitHub personal access token for accessing release data
The typical data processing workflow is:
retrieve → transform → combine → match
↓
cache (optional, speeds up combine)
- Retrieve - Scrape raw data from cinema websites and external sources
- Transform - Normalize data and match movies against TMDB
- Combine - Merge all cinemas into a unified dataset with enriched metadata
- Match - Add ratings and links from external review sites
- Cache (optional) - Pre-cache TMDB data to speed up future combine runs
This function retrieves data from supported cinemas and sources, and saves it as a single JSON file.
To run this script:
# Internally
npm run retrieve <cinema|source>
# Externally
npx clusterflick/scripts retrieve <cinema|source>
Where <cinema|source> can be substituted for any cinema under cinemas/ (e.g.
princecharlescinema.com) or source under sources/ (e.g. eventbrite.co.uk)
Once complete, data will be saved as a JSON blob in the retrieved-data/
directory in a file named the same as the cinema or source used.
Retrieving information from the Prince Charles Cinema
> $ npm run retrieve princecharlescinema.com
> scripts@1.0.0 retrieve
> TZ=Europe/London node index.js retrieve princecharlescinema.com
[🎞️ Location: princecharlescinema.com]
Retrieving data ...
- ✅ Retrieved (1s)
> $ ls ./retrieved-data
princecharlescinema.com
This function transforms retrieved data from supported cinemas, and saves it as a single JSON file. See the Transform Pipeline documentation for detailed information on how the transform process works.
ℹ️ Note: Before running this script, please make sure you have:
- Set up a
.envfile containing your Movie DB API key (MOVIEDB_API_KEY) and Gemini API key (GEMINI_API_KEY) - retrieved the necessary cinema and source data using the
retrievescript (above)
To run this script:
# Internally
npm run transform <cinema>
# Externally
npx clusterflick/scripts transform <cinema>
Where <cinema> can be substituted for any cinema under cinemas/ (e.g.
princecharlescinema.com).
Once complete, data will be saved as a JSON blob in the transformed-data/
directory in a file named the same as the cinema used.
The data output will conform to the JSON schema defined in ./schema.json
Transforming information from the Prince Charles Cinema
> $ npm run transform princecharlescinema.com
> scripts@1.0.0 transform
> TZ=Europe/London node index.js transform princecharlescinema.com
[🎞️ Location: princecharlescinema.com]
Transforming data ...
- ✅ Transformed (0s)
Matching data ...
- ✅ Matched (218/227 in 1s)
Checking historical data ...
- Found 3 new movies
- ✅ Done (2s)
Categorising data ...
- ✅ Categorised (1s)
Processing multiple-movies events ...
- ✅ Processed 2 multi-movie events (0s)
Validating data ...
- ✅ Validated (0s)
> $ ls ./transformed-data
princecharlescinema.com
This function combines transformed data from all cinemas into a single unified dataset. It enriches movies with additional metadata from TMDB (classification, cast, crew, genres, trailers) and merges duplicate movies that appear across multiple venues.
ℹ️ Note: Before running this script, please make sure you have:
- Set up a
.envfile containing your GitHub personal access token (PAT) - Transformed data for all cinemas using the
transformscript
To run this script:
# Internally
npm run combine
# Externally
npx clusterflick/scripts combine
Once complete, data will be saved in combined-data/combined-data.json.
This function matches movies from the combined data against external review sources to retrieve ratings and review URLs.
ℹ️ Note: Before running this script, please make sure you have:
- Combined data using the
combinescript
To run this script:
# Internally
npm run match <source>
# Externally
npx clusterflick/scripts match <source>
Where <source> can be one of:
rottentomatoes- Match against Rotten Tomatoesmetacritic- Match against Metacriticletterboxd- Match against Letterboxdimdb- Match against IMDb
Once complete, data will be saved in the matched-data/ directory.
This function pre-caches TMDB movie data for all transformed movies. This speeds up subsequent combine runs by avoiding repeated API calls.
To run this script:
# Internally
npm run cache
# Externally
npx clusterflick/scripts cache
Once complete, cached data will be saved in cached-data/moviedb-data.json.
These scripts help manage local data directories:
| Script | Description |
|---|---|
npm run clear:cache |
Remove cached API responses |
npm run clear:retrieved-data |
Remove all retrieved data |
npm run clear:transformed-data |
Remove all transformed data |
npm run clear:combined-data |
Remove combined data |
npm run clear:matched-data |
Remove matched data |
npm run clear:all |
Remove all of the above |
Scripts in the helpers/ directory provide additional functionality for
development and debugging.
These scripts download data from the clusterflick GitHub repositories, useful for local development without running the full pipeline:
| Script | Description |
|---|---|
./helpers/get-latest-retrieved-data.sh |
Download latest retrieved data from all cinemas |
./helpers/get-latest-transformed-data.sh |
Download latest transformed data from all cinemas |
./helpers/get-latest-combined-data.sh |
Download latest combined dataset |
./helpers/get-last-10-days-combined-data.sh [dir] [days] |
Download combined data from the last N days (default: 10) |
Requirements: curl, wget, and jq (for the 10-days script)
Manually test the TMDB matching logic for a specific movie title:
node helpers/run-matcher.js "<title>" [year] [directors] [actors] [matchingHints]
Examples:
# Basic title search
node helpers/run-matcher.js "The Godfather"
# With year
node helpers/run-matcher.js "The Godfather" 1972
# With director
node helpers/run-matcher.js "The Godfather" 1972 "Francis Ford Coppola"
# With multiple actors (comma-separated)
node helpers/run-matcher.js "The Godfather" 1972 "" "Marlon Brando,Al Pacino"List all movies from transformed data that failed to match against TMDB, grouped by title. Useful for identifying matching issues:
node helpers/highlight-hydration-misses-for-review.js
Output includes for each unmatched movie:
- Category (movie, event, multiple-movies, etc.)
- Normalized title and year
- TMDB search link
- Source URL
- Venues where it appears
Also displays a summary of unmatched entries grouped by category.