-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Zombie Node Cleanup #9052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Zombie Node Cleanup #9052
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alimaazamat The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
4f7f408 to
ff7d23a
Compare
5f4e210 to
56ff2f5
Compare
56ff2f5 to
9711168
Compare
c759518 to
d34270d
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
A customer had a cleanup script to cleanup "zombie nodes", non-functional Azure infra/Kubernetes nodes that are persisting in a Kubernetes cluster (and in the component state machines in cloud-provider-azure and CA). The customer has an old-fashioned script that looks for well-known "bad terminal states" of VMSS VMs and then deletes those. This PR implements that customer need into CA so that logic can be done from a point of authority.
azure_zombie_cleanup.gois the cleanup implementation:Key notes:
Functions Implemented:
cleanupZombieNodes(): Main entry pointcleanupZombieNodesWithContext(nodes): Accepts K8s nodes for correlationevaluateZombieStatus(vm, k8sNodeMap, time, minAge): Returns (isZombie, hasK8sNode, reason)normalizeProviderID(providerID): Matches Azure IDs to K8s provider IDsThe implementation is called from
forceRefresh()inazure_manager.goruns every interval ofVmssCacheTTLInSeconds(default is 1min)Tests:
Scenario Detection Tests:
Helper Functions:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: