Create structured approach to title normalization

Plan generated by LLM -- a good starting point, though perhaps a little overengineered. It's done a good job a categorising the different corrections required, though an initial pass may be to get it to see if there's any common patterns in the known removable phrases.

---

# Refactor Plan: normalize-title.js into Categorized Rules

## Overview

The current `normalize-title.js` contains ~527 correction rules in a single
array, mixing different types of transformations. This plan outlines how to
refactor it into categorized, maintainable rule sets.

---

## Current State Analysis

The `corrections` array (lines 21-548) contains rules that fall into these
distinct categories:

### 1. Spelling Corrections (~40 rules)

Typos in cinema listings that need fixing to match TMDB:

```javascript
["Wildnerness", "Wilderness"],
["Carvaggio", "Caravaggio"],
["Labryinth", "Labyrinth"],
["Downtown Abbey", "Downton Abbey"],
["Prime Minster", "Prime Minister"],
```

### 2. Separator Standardization (~80 rules)

Converting `-` to `:` or standardizing formatting:

```javascript
["Average Rob -", "Average Rob:"],
["CBeebies - ", "CBeebies: "],
["Film Club -", "Film Club: "],
["Migrant Cinema - ", "Migrant Cinema: "],
```

### 3. Plus/Ampersand Normalization (~30 rules)

Standardizing `+` to `&` for double bills:

```javascript
[" + Cat", " and Cat"],
["Tiddler + ", "Tiddler & "],
["friends + crew", "friends & crew"],
```

### 4. Title Expansions (~25 rules)

Short titles that need expanding to match TMDB's full title:

```javascript
[/^Mishima$/i, "Mishima: A Life in Four Chapters"],
[/^Dr\.? Strangelove$/i, "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"],
["Mulholland Dr.", "Mulholland Drive"],
```

### 5. Franchise Standardization (~40 rules)

Consistent naming for film series:

```javascript
["Mission: Impossible - ", "Mission: Impossible – "],
["Mission: Impossible 2", "Mission: Impossible II"],
[/Lord of the Rings -/i, "Lord of the Rings: "],
["Frozen 2", "Frozen II"],
```

### 6. Indian Cinema Transliteration (~20 rules)

Variant spellings of Indian film titles:

```javascript
["Vasthunnam", "Vasthunam"],
["Sardaar Ji", "Sardaarji"],
["Bāhubali", "Baahubali"],
["Shanthamee Reethriyil", "Shanthamee Raathriyil"],
```

### 7. Mystery/Secret Screenings (~8 rules)

Collapsing various "mystery movie" naming patterns:

```javascript
[/(free |monthly )?mystery ([\w+]+ )?([\w+]+ )?(night|film|movie|cinema|screening):?/i, "mystery movie"],
[/(classic )?secret scre(e|a)(n|m)ing( \d+)?/i, "mystery movie"],
```

### 8. Cinema-Specific Quirks (~50 rules)

Corrections for how specific cinemas name things:

```javascript
["Battleground + intro ", "Battlefield + intro "], // BFI gets the name wrong
["ODEON Pride Nights - ", "ODEON Pride Nights "],
```

### 9. Suffix Removal (~30 rules)

Removing trailing noise:

```javascript
[/\s+dub?$/i, ""],  // Dubbed
[/\s+sub?$/i, ""],  // Subbed
[/\s+(3|2)d$/i, ""], // 3D/2D
[/\s+extended$/i, ""],
```

### 10. Year-Bound Corrections (~200+ rules)

Specific corrections that only make sense for a limited time:

```javascript
["CBeebies Panto 2025", "CBeebies Panto"],
["Disney Junior Cinema Club 2025", "Disney Junior Cinema Club"],
[" + 28YL: The Bone Temple", " "], // Specific to 2025 releases
```

---

## Proposed Directory Structure

```
common/
  normalize-title/
    index.js                        # Main entry point
    rules/
      index.js                      # Combines all rules
      spelling-corrections.js       # Typos
      separator-standardization.js  # Dash to colon conversions
      plus-ampersand-normalization.js
      title-expansions.js           # Short → full titles
      franchise-standardization.js  # Series naming consistency
      indian-cinema.js              # Transliteration variants
      mystery-screenings.js         # Secret/mystery movie patterns
      suffix-removal.js             # Trailing noise removal
      cinema-specific.json          # Data file, not code
      time-bound-corrections.json   # With expiry dates
```

---

## Implementation Details

### JSON Data Files for Static Rules

For cinema-specific quirks that are purely data:

```json
// cinema-specific.json
{
  "rules": [
    {
      "pattern": "Battleground + intro ",
      "replacement": "Battlefield + intro ",
      "source": "bfi.org.uk",
      "note": "BFI incorrectly names this film"
    },
    {
      "pattern": "ODEON Pride Nights - ",
      "replacement": "ODEON Pride Nights ",
      "source": "odeon.co.uk"
    }
  ]
}
```

### Time-Bound Rules with Expiry Dates

```json
// time-bound-corrections.json
{
  "rules": [
    {
      "pattern": "CBeebies Panto 2025",
      "replacement": "CBeebies Panto",
      "expires": "2026-03-01",
      "note": "Remove after panto season ends"
    },
    {
      "pattern": " + 28YL: The Bone Temple",
      "replacement": " ",
      "expires": "2025-12-31",
      "note": "Double bill promo for 28 Years Later"
    }
  ]
}
```

### Category-Specific Rule Files

```javascript
// rules/spelling-corrections.js
module.exports = [
  // General English typos
  ["Wildnerness", "Wilderness"],
  ["Labryinth", "Labyrinth"],
  ["behaviour", "behavior"],
  ["colourful", "colorful"],

  // Film-specific misspellings
  ["Carvaggio", "Caravaggio"],
  ["Downtown Abbey", "Downton Abbey"],
  ["The God Father", "The Godfather"],
];
```

```javascript
// rules/franchise-standardization.js
module.exports = [
  // Mission: Impossible
  ["Mission: Impossible - ", "Mission: Impossible – "],
  ["Mission: Impossible 2", "Mission: Impossible II"],
  [/M:I Season: (?!Mission)/i, "M:I Season: Mission: Impossible – "],

  // Lord of the Rings
  [/(?:\s|^)LOTR(?:\s|:)(?:\s*the\s+)?/i, "The Lord of the Rings: The "],
  [/Lord of the Rings -/i, "Lord of the Rings: "],

  // Frozen
  ["Frozen 2", "Frozen II"],
];
```

### Rule Loader with Expiry Filtering

```javascript
// rules/index.js
const fs = require("fs");
const path = require("path");

const spellingCorrections = require("./spelling-corrections");
const separatorStandardization = require("./separator-standardization");
const franchiseStandardization = require("./franchise-standardization");
// ... etc

// Load JSON data files
const cinemaSpecific = require("./cinema-specific.json");
const timeBound = require("./time-bound-corrections.json");

// Filter out expired time-bound rules
const activeTimeBound = timeBound.rules
  .filter((rule) => new Date(rule.expires) > new Date())
  .map((rule) => [rule.pattern, rule.replacement]);

// Convert JSON rules to array format
const cinemaRules = cinemaSpecific.rules.map((rule) => [
  rule.pattern,
  rule.replacement,
]);

module.exports = [
  ...spellingCorrections,
  ...separatorStandardization,
  ...franchiseStandardization,
  ...cinemaRules,
  ...activeTimeBound,
  // ... etc
];
```

### Simplified Main Function

```javascript
// normalize-title/index.js
const removeDiacritics = require("diacritics").remove;
const rules = require("./rules");
const prefixStrippers = require("./prefix-strippers");
const characterNormalization = require("./character-normalization");

function normalizeTitle(title, options) {
  title = title.replace(/\s+/g, " ").toLowerCase();

  // Apply categorized rules
  for (const [pattern, replacement] of rules) {
    const regex = typeof pattern === "string" ? pattern.toLowerCase() : pattern;
    title = title.replace(regex, replacement.toLowerCase());
  }

  // Apply prefix stripping (the "presents:", "screening:" logic)
  title = prefixStrippers.apply(title);

  // Apply character normalization (diacritics, punctuation, etc.)
  title = characterNormalization.apply(title);

  return title.trim();
}

module.exports = normalizeTitle;
```

---

## Benefits

| Benefit              | Description                                                                  |
| -------------------- | ---------------------------------------------------------------------------- |
| **Maintainability**  | Each category is in its own file, easier to find and update                  |
| **Self-Documenting** | JSON files can include `source` and `note` fields explaining why rules exist |
| **Auto-Expiry**      | Time-bound rules automatically stop applying after their expiry date         |
| **Testability**      | Each category can have its own unit tests                                    |
| **Discoverability**  | When a cinema reports an issue, you check their specific file                |
| **Metrics**          | You can count rules per category to see where complexity lies                |

---

## Migration Strategy

### Phase 1: Infrastructure

1. Create the `normalize-title/` directory structure
2. Create the rule loader (`rules/index.js`)
3. Update `normalize-title.js` to use the new structure (still with all rules
   inline)
4. Verify all tests pass

### Phase 2: Extract Clean Categories (one at a time)

1. **Spelling Corrections** - Pure data, easiest to extract
2. **Suffix Removal** - Well-defined patterns
3. **Mystery Screenings** - Small, self-contained
4. **Franchise Standardization** - Grouped by franchise

For each extraction:

- Move rules to new file
- Add category-specific unit tests
- Verify main tests still pass

### Phase 3: Extract Complex Categories

1. **Separator Standardization** - Large but straightforward
2. **Plus/Ampersand Normalization** - Related to separators
3. **Title Expansions** - Need careful testing
4. **Indian Cinema** - May need transliteration expertise

### Phase 4: Data-Driven Categories

1. **Cinema-Specific Quirks** - Convert to JSON with metadata
2. **Time-Bound Corrections** - Add expiry date infrastructure
3. Add CI check for expired rules (optional cleanup reminder)

### Phase 5: Refactor Prefix Stripping

The `hasPresents`, `hasScreenings`, etc. logic (lines 570-675) should be
extracted to its own module with a cleaner pattern-based approach.

---

## Testing Strategy

### Preserve Existing Tests

The existing `normalize-title.test.js` with `test-titles.json` should pass
throughout the refactoring. These are regression tests.

### Add Category-Specific Tests

```javascript
// rules/spelling-corrections.test.js
const spellingCorrections = require("./spelling-corrections");
const normalizeTitle = require("../index");

describe("Spelling Corrections", () => {
  test.each(spellingCorrections)('corrects "%s" to "%s"', (input, expected) => {
    const pattern = typeof input === "string" ? input : input.source;
    expect(normalizeTitle(pattern)).toContain(expected.toLowerCase());
  });
});
```

### Add Expiry Tests

```javascript
// rules/time-bound-corrections.test.js
const timeBound = require("./time-bound-corrections.json");

describe("Time-Bound Corrections", () => {
  it("all rules have valid expiry dates", () => {
    for (const rule of timeBound.rules) {
      expect(new Date(rule.expires)).toBeInstanceOf(Date);
      expect(new Date(rule.expires).toString()).not.toBe("Invalid Date");
    }
  });

  it("warns about rules expiring soon", () => {
    const thirtyDaysFromNow = new Date();
    thirtyDaysFromNow.setDate(thirtyDaysFromNow.getDate() + 30);

    const expiringSoon = timeBound.rules.filter(
      (rule) => new Date(rule.expires) < thirtyDaysFromNow,
    );

    if (expiringSoon.length > 0) {
      console.warn("Rules expiring within 30 days:", expiringSoon);
    }
  });
});
```

---

## Estimated Effort

| Phase                       | Effort | Risk   |
| --------------------------- | ------ | ------ |
| Phase 1: Infrastructure     | Low    | Low    |
| Phase 2: Clean Categories   | Medium | Low    |
| Phase 3: Complex Categories | Medium | Medium |
| Phase 4: Data-Driven        | Medium | Low    |
| Phase 5: Prefix Stripping   | Medium | Medium |

**Total estimated effort:** Medium

**Recommended approach:** Do one phase per sprint/cycle, validating thoroughly
before moving to the next.


Benefit	Description
Maintainability	Each category is in its own file, easier to find and update
Self-Documenting	JSON files can include `source` and `note` fields explaining why rules exist
Auto-Expiry	Time-bound rules automatically stop applying after their expiry date
Testability	Each category can have its own unit tests
Discoverability	When a cinema reports an issue, you check their specific file
Metrics	You can count rules per category to see where complexity lies

Phase	Effort	Risk
Phase 1: Infrastructure	Low	Low
Phase 2: Clean Categories	Medium	Low
Phase 3: Complex Categories	Medium	Medium
Phase 4: Data-Driven	Medium	Low
Phase 5: Prefix Stripping	Medium	Medium

Create structured approach to title normalization #209

Description

Refactor Plan: normalize-title.js into Categorized Rules

Overview

Current State Analysis

1. Spelling Corrections (~40 rules)

2. Separator Standardization (~80 rules)

3. Plus/Ampersand Normalization (~30 rules)

4. Title Expansions (~25 rules)

5. Franchise Standardization (~40 rules)

6. Indian Cinema Transliteration (~20 rules)

7. Mystery/Secret Screenings (~8 rules)

8. Cinema-Specific Quirks (~50 rules)

9. Suffix Removal (~30 rules)

10. Year-Bound Corrections (~200+ rules)

Proposed Directory Structure

Implementation Details

JSON Data Files for Static Rules

Time-Bound Rules with Expiry Dates

Category-Specific Rule Files

Rule Loader with Expiry Filtering

Simplified Main Function

Benefits

Migration Strategy

Phase 1: Infrastructure

Phase 2: Extract Clean Categories (one at a time)

Phase 3: Extract Complex Categories

Phase 4: Data-Driven Categories

Phase 5: Refactor Prefix Stripping

Testing Strategy

Preserve Existing Tests

Add Category-Specific Tests

Add Expiry Tests

Estimated Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions