Skip to content

Create structured approach to title normalization #209

@alistairjcbrown

Description

@alistairjcbrown

Plan generated by LLM -- a good starting point, though perhaps a little overengineered. It's done a good job a categorising the different corrections required, though an initial pass may be to get it to see if there's any common patterns in the known removable phrases.


Refactor Plan: normalize-title.js into Categorized Rules

Overview

The current normalize-title.js contains ~527 correction rules in a single
array, mixing different types of transformations. This plan outlines how to
refactor it into categorized, maintainable rule sets.


Current State Analysis

The corrections array (lines 21-548) contains rules that fall into these
distinct categories:

1. Spelling Corrections (~40 rules)

Typos in cinema listings that need fixing to match TMDB:

["Wildnerness", "Wilderness"],
["Carvaggio", "Caravaggio"],
["Labryinth", "Labyrinth"],
["Downtown Abbey", "Downton Abbey"],
["Prime Minster", "Prime Minister"],

2. Separator Standardization (~80 rules)

Converting - to : or standardizing formatting:

["Average Rob -", "Average Rob:"],
["CBeebies - ", "CBeebies: "],
["Film Club -", "Film Club: "],
["Migrant Cinema - ", "Migrant Cinema: "],

3. Plus/Ampersand Normalization (~30 rules)

Standardizing + to & for double bills:

[" + Cat", " and Cat"],
["Tiddler + ", "Tiddler & "],
["friends + crew", "friends & crew"],

4. Title Expansions (~25 rules)

Short titles that need expanding to match TMDB's full title:

[/^Mishima$/i, "Mishima: A Life in Four Chapters"],
[/^Dr\.? Strangelove$/i, "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"],
["Mulholland Dr.", "Mulholland Drive"],

5. Franchise Standardization (~40 rules)

Consistent naming for film series:

["Mission: Impossible - ", "Mission: Impossible – "],
["Mission: Impossible 2", "Mission: Impossible II"],
[/Lord of the Rings -/i, "Lord of the Rings: "],
["Frozen 2", "Frozen II"],

6. Indian Cinema Transliteration (~20 rules)

Variant spellings of Indian film titles:

["Vasthunnam", "Vasthunam"],
["Sardaar Ji", "Sardaarji"],
["Bāhubali", "Baahubali"],
["Shanthamee Reethriyil", "Shanthamee Raathriyil"],

7. Mystery/Secret Screenings (~8 rules)

Collapsing various "mystery movie" naming patterns:

[/(free |monthly )?mystery ([\w+]+ )?([\w+]+ )?(night|film|movie|cinema|screening):?/i, "mystery movie"],
[/(classic )?secret scre(e|a)(n|m)ing( \d+)?/i, "mystery movie"],

8. Cinema-Specific Quirks (~50 rules)

Corrections for how specific cinemas name things:

["Battleground + intro ", "Battlefield + intro "], // BFI gets the name wrong
["ODEON Pride Nights - ", "ODEON Pride Nights "],

9. Suffix Removal (~30 rules)

Removing trailing noise:

[/\s+dub?$/i, ""],  // Dubbed
[/\s+sub?$/i, ""],  // Subbed
[/\s+(3|2)d$/i, ""], // 3D/2D
[/\s+extended$/i, ""],

10. Year-Bound Corrections (~200+ rules)

Specific corrections that only make sense for a limited time:

["CBeebies Panto 2025", "CBeebies Panto"],
["Disney Junior Cinema Club 2025", "Disney Junior Cinema Club"],
[" + 28YL: The Bone Temple", " "], // Specific to 2025 releases

Proposed Directory Structure

common/
  normalize-title/
    index.js                        # Main entry point
    rules/
      index.js                      # Combines all rules
      spelling-corrections.js       # Typos
      separator-standardization.js  # Dash to colon conversions
      plus-ampersand-normalization.js
      title-expansions.js           # Short → full titles
      franchise-standardization.js  # Series naming consistency
      indian-cinema.js              # Transliteration variants
      mystery-screenings.js         # Secret/mystery movie patterns
      suffix-removal.js             # Trailing noise removal
      cinema-specific.json          # Data file, not code
      time-bound-corrections.json   # With expiry dates

Implementation Details

JSON Data Files for Static Rules

For cinema-specific quirks that are purely data:

// cinema-specific.json
{
  "rules": [
    {
      "pattern": "Battleground + intro ",
      "replacement": "Battlefield + intro ",
      "source": "bfi.org.uk",
      "note": "BFI incorrectly names this film"
    },
    {
      "pattern": "ODEON Pride Nights - ",
      "replacement": "ODEON Pride Nights ",
      "source": "odeon.co.uk"
    }
  ]
}

Time-Bound Rules with Expiry Dates

// time-bound-corrections.json
{
  "rules": [
    {
      "pattern": "CBeebies Panto 2025",
      "replacement": "CBeebies Panto",
      "expires": "2026-03-01",
      "note": "Remove after panto season ends"
    },
    {
      "pattern": " + 28YL: The Bone Temple",
      "replacement": " ",
      "expires": "2025-12-31",
      "note": "Double bill promo for 28 Years Later"
    }
  ]
}

Category-Specific Rule Files

// rules/spelling-corrections.js
module.exports = [
  // General English typos
  ["Wildnerness", "Wilderness"],
  ["Labryinth", "Labyrinth"],
  ["behaviour", "behavior"],
  ["colourful", "colorful"],

  // Film-specific misspellings
  ["Carvaggio", "Caravaggio"],
  ["Downtown Abbey", "Downton Abbey"],
  ["The God Father", "The Godfather"],
];
// rules/franchise-standardization.js
module.exports = [
  // Mission: Impossible
  ["Mission: Impossible - ", "Mission: Impossible – "],
  ["Mission: Impossible 2", "Mission: Impossible II"],
  [/M:I Season: (?!Mission)/i, "M:I Season: Mission: Impossible – "],

  // Lord of the Rings
  [/(?:\s|^)LOTR(?:\s|:)(?:\s*the\s+)?/i, "The Lord of the Rings: The "],
  [/Lord of the Rings -/i, "Lord of the Rings: "],

  // Frozen
  ["Frozen 2", "Frozen II"],
];

Rule Loader with Expiry Filtering

// rules/index.js
const fs = require("fs");
const path = require("path");

const spellingCorrections = require("./spelling-corrections");
const separatorStandardization = require("./separator-standardization");
const franchiseStandardization = require("./franchise-standardization");
// ... etc

// Load JSON data files
const cinemaSpecific = require("./cinema-specific.json");
const timeBound = require("./time-bound-corrections.json");

// Filter out expired time-bound rules
const activeTimeBound = timeBound.rules
  .filter((rule) => new Date(rule.expires) > new Date())
  .map((rule) => [rule.pattern, rule.replacement]);

// Convert JSON rules to array format
const cinemaRules = cinemaSpecific.rules.map((rule) => [
  rule.pattern,
  rule.replacement,
]);

module.exports = [
  ...spellingCorrections,
  ...separatorStandardization,
  ...franchiseStandardization,
  ...cinemaRules,
  ...activeTimeBound,
  // ... etc
];

Simplified Main Function

// normalize-title/index.js
const removeDiacritics = require("diacritics").remove;
const rules = require("./rules");
const prefixStrippers = require("./prefix-strippers");
const characterNormalization = require("./character-normalization");

function normalizeTitle(title, options) {
  title = title.replace(/\s+/g, " ").toLowerCase();

  // Apply categorized rules
  for (const [pattern, replacement] of rules) {
    const regex = typeof pattern === "string" ? pattern.toLowerCase() : pattern;
    title = title.replace(regex, replacement.toLowerCase());
  }

  // Apply prefix stripping (the "presents:", "screening:" logic)
  title = prefixStrippers.apply(title);

  // Apply character normalization (diacritics, punctuation, etc.)
  title = characterNormalization.apply(title);

  return title.trim();
}

module.exports = normalizeTitle;

Benefits

Benefit Description
Maintainability Each category is in its own file, easier to find and update
Self-Documenting JSON files can include source and note fields explaining why rules exist
Auto-Expiry Time-bound rules automatically stop applying after their expiry date
Testability Each category can have its own unit tests
Discoverability When a cinema reports an issue, you check their specific file
Metrics You can count rules per category to see where complexity lies

Migration Strategy

Phase 1: Infrastructure

  1. Create the normalize-title/ directory structure
  2. Create the rule loader (rules/index.js)
  3. Update normalize-title.js to use the new structure (still with all rules
    inline)
  4. Verify all tests pass

Phase 2: Extract Clean Categories (one at a time)

  1. Spelling Corrections - Pure data, easiest to extract
  2. Suffix Removal - Well-defined patterns
  3. Mystery Screenings - Small, self-contained
  4. Franchise Standardization - Grouped by franchise

For each extraction:

  • Move rules to new file
  • Add category-specific unit tests
  • Verify main tests still pass

Phase 3: Extract Complex Categories

  1. Separator Standardization - Large but straightforward
  2. Plus/Ampersand Normalization - Related to separators
  3. Title Expansions - Need careful testing
  4. Indian Cinema - May need transliteration expertise

Phase 4: Data-Driven Categories

  1. Cinema-Specific Quirks - Convert to JSON with metadata
  2. Time-Bound Corrections - Add expiry date infrastructure
  3. Add CI check for expired rules (optional cleanup reminder)

Phase 5: Refactor Prefix Stripping

The hasPresents, hasScreenings, etc. logic (lines 570-675) should be
extracted to its own module with a cleaner pattern-based approach.


Testing Strategy

Preserve Existing Tests

The existing normalize-title.test.js with test-titles.json should pass
throughout the refactoring. These are regression tests.

Add Category-Specific Tests

// rules/spelling-corrections.test.js
const spellingCorrections = require("./spelling-corrections");
const normalizeTitle = require("../index");

describe("Spelling Corrections", () => {
  test.each(spellingCorrections)('corrects "%s" to "%s"', (input, expected) => {
    const pattern = typeof input === "string" ? input : input.source;
    expect(normalizeTitle(pattern)).toContain(expected.toLowerCase());
  });
});

Add Expiry Tests

// rules/time-bound-corrections.test.js
const timeBound = require("./time-bound-corrections.json");

describe("Time-Bound Corrections", () => {
  it("all rules have valid expiry dates", () => {
    for (const rule of timeBound.rules) {
      expect(new Date(rule.expires)).toBeInstanceOf(Date);
      expect(new Date(rule.expires).toString()).not.toBe("Invalid Date");
    }
  });

  it("warns about rules expiring soon", () => {
    const thirtyDaysFromNow = new Date();
    thirtyDaysFromNow.setDate(thirtyDaysFromNow.getDate() + 30);

    const expiringSoon = timeBound.rules.filter(
      (rule) => new Date(rule.expires) < thirtyDaysFromNow,
    );

    if (expiringSoon.length > 0) {
      console.warn("Rules expiring within 30 days:", expiringSoon);
    }
  });
});

Estimated Effort

Phase Effort Risk
Phase 1: Infrastructure Low Low
Phase 2: Clean Categories Medium Low
Phase 3: Complex Categories Medium Medium
Phase 4: Data-Driven Medium Low
Phase 5: Prefix Stripping Medium Medium

Total estimated effort: Medium

Recommended approach: Do one phase per sprint/cycle, validating thoroughly
before moving to the next.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions