-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[ca] fix: undo Scale operation when a scaleUpRequest times out #9132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[ca] fix: undo Scale operation when a scaleUpRequest times out #9132
Conversation
|
|
|
Welcome @thatmidwesterncoder! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: thatmidwesterncoder The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @thatmidwesterncoder. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR modifies the cluster-autoscaler to automatically revert (decrease) the target size of a node group when a scale-up request times out.
Currently, when a scale-up fails due to infrastructure provider capacity limits (e.g., CAPI infrastructure provider cannot provision new nodes), the autoscaler removes the scale-up request from tracking and puts the node group in backoff, but it does not decrease the target size back to its original value. This causes the infrastructure provider to indefinitely retry provisioning the failed nodes, even after the workload that triggered the scale-up has disappeared. Manual intervention is required to scale the cluster back down.
With this fix, when a scale-up request times out:
DecreaseTargetSize()on the node group to revert the increaseThis allows clusters running close to infrastructure capacity to automatically recover from failed scale-ups without manual intervention.
Which issue(s) this PR fixes:
Fixes #9120
Special notes for your reviewer:
DecreaseTargetSizemethod is called with a negative value equal to the originalIncreasefrom the scale-up requestWithOnScaleUphandler since the test provider now needs to handle the decrease operationTestExpiredScaleUpRevertsTargetSize- verifies target size is decreased on timeoutTestExpiredScaleUpRevertsTargetSizeHandlesError- verifies graceful handling when decrease failsTestExpiredScaleUpRevertsPartialIncrease- verifies correct decrease amount for partial increasesDoes this PR introduce a user-facing change?