Skip to content

ops/pulumi: stage activation docs, safe networking, ALB health check fix#368

Open
e9e4e5f0faef wants to merge 3 commits intostagefrom
feat/ops-monitoring-runbook
Open

ops/pulumi: stage activation docs, safe networking, ALB health check fix#368
e9e4e5f0faef wants to merge 3 commits intostagefrom
feat/ops-monitoring-runbook

Conversation

@e9e4e5f0faef
Copy link
Collaborator

Summary

  • Add ALB health check middleware to resolve Django DisallowedHost failures on ECS/Fargate
  • Fix Fargate networking: enable assign_public_ip for external services (ECR image pull fix)
  • Add missing NETAPP_STORAGE_ROOT environment variable to all container definitions
  • Update Pulumi README with correct resource names, safety layers, and architecture

Changes

Category File Purpose
Middleware src/olympia/amo/middleware_healthcheck.py Intercepts ALB health checks before Django host validation
Settings settings_local_stage.py Prepend health check middleware to MIDDLEWARE
Infra infra/pulumi/__main__.py Force assign_public_ip=True for external Fargate services
Config infra/pulumi/config.stage.yaml Add NETAPP_STORAGE_ROOT to web, worker, versioncheck
Docs infra/pulumi/README.md Updated architecture, safety layers, correct service names
Docs infra/DEPLOYMENT_VALIDATION.md Deployment and RO healthcheck validation results
Config .gitignore Exclude local ops artefacts (infra/adhoc/, infra/docs/)

Why

ALB health checks reach containers with Host: <load-balancer-node-IP>, which Django
rejects via ALLOWED_HOSTS. Adding .compute.internal does not help because the value
is an IP, not a hostname, and Django does not support CIDR in ALLOWED_HOSTS. We avoid
ALLOWED_HOSTS = ['*'] (over-broad) and avoid ECS metadata calls during settings import.

External Fargate services in public subnets also need assign_public_ip to reach ECR
(tb_pulumi uses a single subnet list for ALB and tasks; documented as a TODO for upstream).

Approach

  • Small middleware catches the health check path (GET /services/monitor.json) and returns
    200 before Django's host validation runs. All other paths keep strict ALLOWED_HOSTS.
  • assign_public_ip is forced in __main__.py for services with internal: false, keeping
    config.stage.yaml clean (the code overrides at runtime).
  • NETAPP_STORAGE_ROOT=/tmp/storage satisfies Django's required setting using ephemeral
    container storage (no EFS mount needed in RO mode).

Validation

Tested end-to-end on versioncheck via ad hoc image build:

  1. Built and pushed a temporary image tag (test-healthcheck-fix) to ECR
  2. Registered a new task definition revision pointing to the test image
  3. Updated versioncheck service to the test revision with desiredCount=1
  4. Confirmed ALB target became healthy and /services/monitor.json returned HTTP 200
  5. Confirmed zero DisallowedHost errors in CloudWatch logs during steady state
  6. Reverted to the previous task definition (revision 6) and scaled to 0 after validation

Earlier in the session: full pulumi up (157 resources), RO MySQL user creation,
RO healthcheck (5/5 passed), and CI build-and-push confirmed working via manual dispatch.

Safety

Layer State
All ECS services desired_count: 0
Autoscaling Suspended, min_capacity: 0
EventBridge schedules All 16 DISABLED
DB credentials BOOTSTRAP_SAFE=true (RO user)
Test image Rolled back; no test tags running in stage

Follow-ups

  • After merge, CI builds new stage-latest containing the middleware fix
  • Re-validate versioncheck with CI-built image, then proceed to web
  • Workers require EC2 worker shutdown coordination (separate step)

@e9e4e5f0faef e9e4e5f0faef requested a review from Sancus March 7, 2026 20:58
@e9e4e5f0faef e9e4e5f0faef self-assigned this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant