-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Plan generated by LLM -- a good starting point, though perhaps a little overengineered. It's done a good job a categorising the different corrections required, though an initial pass may be to get it to see if there's any common patterns in the known removable phrases.
Refactor Plan: normalize-title.js into Categorized Rules
Overview
The current normalize-title.js contains ~527 correction rules in a single
array, mixing different types of transformations. This plan outlines how to
refactor it into categorized, maintainable rule sets.
Current State Analysis
The corrections array (lines 21-548) contains rules that fall into these
distinct categories:
1. Spelling Corrections (~40 rules)
Typos in cinema listings that need fixing to match TMDB:
["Wildnerness", "Wilderness"],
["Carvaggio", "Caravaggio"],
["Labryinth", "Labyrinth"],
["Downtown Abbey", "Downton Abbey"],
["Prime Minster", "Prime Minister"],2. Separator Standardization (~80 rules)
Converting - to : or standardizing formatting:
["Average Rob -", "Average Rob:"],
["CBeebies - ", "CBeebies: "],
["Film Club -", "Film Club: "],
["Migrant Cinema - ", "Migrant Cinema: "],3. Plus/Ampersand Normalization (~30 rules)
Standardizing + to & for double bills:
[" + Cat", " and Cat"],
["Tiddler + ", "Tiddler & "],
["friends + crew", "friends & crew"],4. Title Expansions (~25 rules)
Short titles that need expanding to match TMDB's full title:
[/^Mishima$/i, "Mishima: A Life in Four Chapters"],
[/^Dr\.? Strangelove$/i, "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"],
["Mulholland Dr.", "Mulholland Drive"],5. Franchise Standardization (~40 rules)
Consistent naming for film series:
["Mission: Impossible - ", "Mission: Impossible – "],
["Mission: Impossible 2", "Mission: Impossible II"],
[/Lord of the Rings -/i, "Lord of the Rings: "],
["Frozen 2", "Frozen II"],6. Indian Cinema Transliteration (~20 rules)
Variant spellings of Indian film titles:
["Vasthunnam", "Vasthunam"],
["Sardaar Ji", "Sardaarji"],
["Bāhubali", "Baahubali"],
["Shanthamee Reethriyil", "Shanthamee Raathriyil"],7. Mystery/Secret Screenings (~8 rules)
Collapsing various "mystery movie" naming patterns:
[/(free |monthly )?mystery ([\w+]+ )?([\w+]+ )?(night|film|movie|cinema|screening):?/i, "mystery movie"],
[/(classic )?secret scre(e|a)(n|m)ing( \d+)?/i, "mystery movie"],8. Cinema-Specific Quirks (~50 rules)
Corrections for how specific cinemas name things:
["Battleground + intro ", "Battlefield + intro "], // BFI gets the name wrong
["ODEON Pride Nights - ", "ODEON Pride Nights "],9. Suffix Removal (~30 rules)
Removing trailing noise:
[/\s+dub?$/i, ""], // Dubbed
[/\s+sub?$/i, ""], // Subbed
[/\s+(3|2)d$/i, ""], // 3D/2D
[/\s+extended$/i, ""],10. Year-Bound Corrections (~200+ rules)
Specific corrections that only make sense for a limited time:
["CBeebies Panto 2025", "CBeebies Panto"],
["Disney Junior Cinema Club 2025", "Disney Junior Cinema Club"],
[" + 28YL: The Bone Temple", " "], // Specific to 2025 releasesProposed Directory Structure
common/
normalize-title/
index.js # Main entry point
rules/
index.js # Combines all rules
spelling-corrections.js # Typos
separator-standardization.js # Dash to colon conversions
plus-ampersand-normalization.js
title-expansions.js # Short → full titles
franchise-standardization.js # Series naming consistency
indian-cinema.js # Transliteration variants
mystery-screenings.js # Secret/mystery movie patterns
suffix-removal.js # Trailing noise removal
cinema-specific.json # Data file, not code
time-bound-corrections.json # With expiry dates
Implementation Details
JSON Data Files for Static Rules
For cinema-specific quirks that are purely data:
// cinema-specific.json
{
"rules": [
{
"pattern": "Battleground + intro ",
"replacement": "Battlefield + intro ",
"source": "bfi.org.uk",
"note": "BFI incorrectly names this film"
},
{
"pattern": "ODEON Pride Nights - ",
"replacement": "ODEON Pride Nights ",
"source": "odeon.co.uk"
}
]
}Time-Bound Rules with Expiry Dates
// time-bound-corrections.json
{
"rules": [
{
"pattern": "CBeebies Panto 2025",
"replacement": "CBeebies Panto",
"expires": "2026-03-01",
"note": "Remove after panto season ends"
},
{
"pattern": " + 28YL: The Bone Temple",
"replacement": " ",
"expires": "2025-12-31",
"note": "Double bill promo for 28 Years Later"
}
]
}Category-Specific Rule Files
// rules/spelling-corrections.js
module.exports = [
// General English typos
["Wildnerness", "Wilderness"],
["Labryinth", "Labyrinth"],
["behaviour", "behavior"],
["colourful", "colorful"],
// Film-specific misspellings
["Carvaggio", "Caravaggio"],
["Downtown Abbey", "Downton Abbey"],
["The God Father", "The Godfather"],
];// rules/franchise-standardization.js
module.exports = [
// Mission: Impossible
["Mission: Impossible - ", "Mission: Impossible – "],
["Mission: Impossible 2", "Mission: Impossible II"],
[/M:I Season: (?!Mission)/i, "M:I Season: Mission: Impossible – "],
// Lord of the Rings
[/(?:\s|^)LOTR(?:\s|:)(?:\s*the\s+)?/i, "The Lord of the Rings: The "],
[/Lord of the Rings -/i, "Lord of the Rings: "],
// Frozen
["Frozen 2", "Frozen II"],
];Rule Loader with Expiry Filtering
// rules/index.js
const fs = require("fs");
const path = require("path");
const spellingCorrections = require("./spelling-corrections");
const separatorStandardization = require("./separator-standardization");
const franchiseStandardization = require("./franchise-standardization");
// ... etc
// Load JSON data files
const cinemaSpecific = require("./cinema-specific.json");
const timeBound = require("./time-bound-corrections.json");
// Filter out expired time-bound rules
const activeTimeBound = timeBound.rules
.filter((rule) => new Date(rule.expires) > new Date())
.map((rule) => [rule.pattern, rule.replacement]);
// Convert JSON rules to array format
const cinemaRules = cinemaSpecific.rules.map((rule) => [
rule.pattern,
rule.replacement,
]);
module.exports = [
...spellingCorrections,
...separatorStandardization,
...franchiseStandardization,
...cinemaRules,
...activeTimeBound,
// ... etc
];Simplified Main Function
// normalize-title/index.js
const removeDiacritics = require("diacritics").remove;
const rules = require("./rules");
const prefixStrippers = require("./prefix-strippers");
const characterNormalization = require("./character-normalization");
function normalizeTitle(title, options) {
title = title.replace(/\s+/g, " ").toLowerCase();
// Apply categorized rules
for (const [pattern, replacement] of rules) {
const regex = typeof pattern === "string" ? pattern.toLowerCase() : pattern;
title = title.replace(regex, replacement.toLowerCase());
}
// Apply prefix stripping (the "presents:", "screening:" logic)
title = prefixStrippers.apply(title);
// Apply character normalization (diacritics, punctuation, etc.)
title = characterNormalization.apply(title);
return title.trim();
}
module.exports = normalizeTitle;Benefits
| Benefit | Description |
|---|---|
| Maintainability | Each category is in its own file, easier to find and update |
| Self-Documenting | JSON files can include source and note fields explaining why rules exist |
| Auto-Expiry | Time-bound rules automatically stop applying after their expiry date |
| Testability | Each category can have its own unit tests |
| Discoverability | When a cinema reports an issue, you check their specific file |
| Metrics | You can count rules per category to see where complexity lies |
Migration Strategy
Phase 1: Infrastructure
- Create the
normalize-title/directory structure - Create the rule loader (
rules/index.js) - Update
normalize-title.jsto use the new structure (still with all rules
inline) - Verify all tests pass
Phase 2: Extract Clean Categories (one at a time)
- Spelling Corrections - Pure data, easiest to extract
- Suffix Removal - Well-defined patterns
- Mystery Screenings - Small, self-contained
- Franchise Standardization - Grouped by franchise
For each extraction:
- Move rules to new file
- Add category-specific unit tests
- Verify main tests still pass
Phase 3: Extract Complex Categories
- Separator Standardization - Large but straightforward
- Plus/Ampersand Normalization - Related to separators
- Title Expansions - Need careful testing
- Indian Cinema - May need transliteration expertise
Phase 4: Data-Driven Categories
- Cinema-Specific Quirks - Convert to JSON with metadata
- Time-Bound Corrections - Add expiry date infrastructure
- Add CI check for expired rules (optional cleanup reminder)
Phase 5: Refactor Prefix Stripping
The hasPresents, hasScreenings, etc. logic (lines 570-675) should be
extracted to its own module with a cleaner pattern-based approach.
Testing Strategy
Preserve Existing Tests
The existing normalize-title.test.js with test-titles.json should pass
throughout the refactoring. These are regression tests.
Add Category-Specific Tests
// rules/spelling-corrections.test.js
const spellingCorrections = require("./spelling-corrections");
const normalizeTitle = require("../index");
describe("Spelling Corrections", () => {
test.each(spellingCorrections)('corrects "%s" to "%s"', (input, expected) => {
const pattern = typeof input === "string" ? input : input.source;
expect(normalizeTitle(pattern)).toContain(expected.toLowerCase());
});
});Add Expiry Tests
// rules/time-bound-corrections.test.js
const timeBound = require("./time-bound-corrections.json");
describe("Time-Bound Corrections", () => {
it("all rules have valid expiry dates", () => {
for (const rule of timeBound.rules) {
expect(new Date(rule.expires)).toBeInstanceOf(Date);
expect(new Date(rule.expires).toString()).not.toBe("Invalid Date");
}
});
it("warns about rules expiring soon", () => {
const thirtyDaysFromNow = new Date();
thirtyDaysFromNow.setDate(thirtyDaysFromNow.getDate() + 30);
const expiringSoon = timeBound.rules.filter(
(rule) => new Date(rule.expires) < thirtyDaysFromNow,
);
if (expiringSoon.length > 0) {
console.warn("Rules expiring within 30 days:", expiringSoon);
}
});
});Estimated Effort
| Phase | Effort | Risk |
|---|---|---|
| Phase 1: Infrastructure | Low | Low |
| Phase 2: Clean Categories | Medium | Low |
| Phase 3: Complex Categories | Medium | Medium |
| Phase 4: Data-Driven | Medium | Low |
| Phase 5: Prefix Stripping | Medium | Medium |
Total estimated effort: Medium
Recommended approach: Do one phase per sprint/cycle, validating thoroughly
before moving to the next.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status