Skip to content

Conversation

@alimaazamat
Copy link

@alimaazamat alimaazamat commented Jan 13, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

A customer had a cleanup script to cleanup "zombie nodes", non-functional Azure infra/Kubernetes nodes that are persisting in a Kubernetes cluster (and in the component state machines in cloud-provider-azure and CA). The customer has an old-fashioned script that looks for well-known "bad terminal states" of VMSS VMs and then deletes those. This PR implements that customer need into CA so that logic can be done from a point of authority.

azure_zombie_cleanup.go is the cleanup implementation:

  1. Check if enabled - Returns early if EnableZombieCleanup is false
  2. Build K8s node lookup - Creates a map of normalized provider IDs to nodes for correlation
  3. Scan all VMSS - Lists all scale sets and their VMs with instance views
  4. Detect zombies - Calls evaluateZombieStatus() for each VM
  5. Split by registration status:
    • Unregistered zombies (no K8s node): Safe to delete directly
    • Registered zombies (has K8s node): Only logs and lets autoscaler handle
  6. Batch delete - Groups unregistered zombies by VMSS and calls DeleteInstancesAsync()

Key notes:

  • VMs that never registered have no K8s state so its safe to delete immediately manually
  • VMs with K8s nodes (unreachable/NotReady) we can pass off to autoscaler to handle proper state deletion
  • Age threshold (default 5 min) prevents deleting recently created VMs
  • Feature EnableZombieCleanup config flag
    • Dry-run mode ZombieCleanupDryRun would log what would be deleted but doesn’t actually take action
  • Batch deletion reduces API calls and improves efficiency

Functions Implemented:

cleanupZombieNodes(): Main entry point
cleanupZombieNodesWithContext(nodes): Accepts K8s nodes for correlation
evaluateZombieStatus(vm, k8sNodeMap, time, minAge): Returns (isZombie, hasK8sNode, reason)
normalizeProviderID(providerID): Matches Azure IDs to K8s provider IDs
The implementation is called from forceRefresh() in azure_manager.go runs every interval of VmssCacheTTLInSeconds (default is 1min)

Tests:

  • TestZombieCleanup_Disabled - Verifies cleanup doesn't run when feature is disabled
  • TestZombieCleanup_NoZombiesFound - Verifies behavior when no zombies exist
  • TestZombieCleanup_DetectsFailedProvisioning - Verifies Scenario 2: Provisioning failed
  • TestZombieCleanup_DetectsFailedExtensions - Verifies Scenario 1: Extensions failed
  • TestZombieCleanup_DetectsNeverRegisteredInstances - Verifies Scenario 3: Never registered
  • TestZombieCleanup_WithK8sNodesContext - Verifies registered zombies are NOT deleted (Scenario 4a: Unreachable taint)
  • TestZombieCleanup_RespectsMinAge - Verifies age threshold is respected
  • TestZombieCleanup_DryRunMode - Verifies dry-run doesn't delete anything
  • TestZombieCleanup_MultipleZombiesInSamePool - Verifies batch deletion
  • TestZombieCleanup_MultipleVMSSPools - Verifies cleanup across multiple pools
  • TestZombieCleanup_MixedZombiesAndHealthy - Verifies only zombies are deleted
  • TestZombieCleanup_IgnoresDeallocatedNodes - Verifies deallocated VMs are not deleted (Scenario 4b: NotReady but deallocated)
    Scenario Detection Tests:
  • TestZombieScenario_ExtensionsFailedToInstall - Demonstrates Scenario 1a: Extensions failed
  • TestZombieScenario_ExtensionsNeverInstalled - Demonstrates Scenario 1b: Extensions never installed (flapping zombie)
  • TestZombieScenario_ProvisioningFailed - Demonstrates Scenario 2: Provisioning failed
  • TestZombieScenario_NeverRegisteredInKubernetes - Demonstrates Scenario 3: Never registered (AllocationFailed)
  • TestZombieScenario_NodeUnreachableTaint - Demonstrates Scenario 4a: Node has unreachable taint
  • TestZombieScenario_NodeNotReady - Demonstrates Scenario 4b: Node NotReady with running VM
  • TestZombieScenario_DeallocatedNodesAreHealthy - Demonstrates healthy deallocated nodes are NOT zombies
  • TestZombieScenario_MultipleZombiesWasteQuota - Demonstrates severe quota waste scenario

Helper Functions:

  • setupMockManager - Setup mock manager with Azure clients
  • newTestAzureManagerForZombieCleanup - Setup a test Azure manager with default config
  • newHealthyVM - Creates a healthy VM for testing
  • newZombieVMWithFailedProvisioning - Creates a VM with failed provisioning state
  • newZombieVMWithFailedExtensions - Creates a VM with failed extensions
  • newZombieVMNeverRegistered - Creates a VM that never registered with K8s
  • newUnreachableZombieVM - Creates a VM that is running but will have unreachable taint
  • newRecentVM - Creates a recently created VM (below age threshold)
  • newDeallocatedVM - Creates a deallocated VM (from autoscaler scale-down)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-area labels Jan 13, 2026
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider labels Jan 13, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alimaazamat
Once this PR has been reviewed and has the lgtm label, please assign nilo19 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Jan 13, 2026
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2026
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 14, 2026
@alimaazamat alimaazamat changed the title [WIP] Zombie Node Cleanup Zombie Node Cleanup Jan 20, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants