-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Extract the RegisterFailedScaleUp metric generation into a separate NodeGroupChangeObserver instance #9136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Extract the RegisterFailedScaleUp metric generation into a separate NodeGroupChangeObserver instance #9136
Conversation
|
Hi @shaikenov. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shaikenov The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
This is a rebased version of #9083. Will leave replies to the relevant comments here for convenience:
We already had the cc: @MartynaGrotek |
|
|
||
| type metricObserver interface { | ||
| RegisterFailedScaleUp(reason metrics.FailedScaleUpReason, gpuResourceName, gpuType string) | ||
| type failedScaleUpInfo struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered reusing and extending already existing struct ScaleUpFailure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I thought that I did not want to change a lot of tests which are using the csr.scaleUpFailes by adding new fields. But I see that it can be done -> modified to use the scaleUpFailure by adding new gpu fields. I prefer not to change the RegisterFailedScaleUp's csr.scaleUpFailures map population since we already have all the fields that we are interested in and gpu fields population will bring zero additional value, IMO
| nodeGroup cloudprovider.NodeGroup | ||
| errorInfo cloudprovider.InstanceErrorInfo | ||
| gpuResourceName string | ||
| gpuType string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields seems to be used only for reporting metrics and can be evaluated based on nodeGroup.
We could also consider moving that logic out of here to NodeGroupChangeObserver and get rid of , gpuResourceName, gpuType string from the interface method RegisterFailedScaleUp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it can be moved, but for the extraction of gpuResourceName, gpuType string from the nodeGroup we would need to pass the cloudProvider or GpuConfig objects to RegisterFailedScaleUp interface anyway cause the GetNodeGpuConfig() is the method of cloud provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Cloud provider could be just passed to the instance of NodeGroupChangeObserver during creation.
| ErrorMessage: errorMessage, | ||
| }, gpuResourceName, gpuType, currentTime) | ||
| csr.scaleUpFailures[nodeGroup.Id()] = append(csr.scaleUpFailures[nodeGroup.Id()], ScaleUpFailure{NodeGroup: nodeGroup, Reason: metrics.FailedScaleUpReason(errorInfo.ErrorCode), Time: currentTime}) | ||
| csr.backoffNodeGroup(nodeGroup, errorInfo, currentTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does backoff require locking? From what I see we are not locking when calling BackoffStatusForNodeGroup().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need locking for exponentialBackoff.Backoff() to properly backoff all failing node groups, and it was implemented like this since the beginning. I am not 100% sure that it is safe to remove it
| interrupt chan struct{} | ||
| nodeGroupConfigProcessor nodegroupconfig.NodeGroupConfigProcessor | ||
| asyncNodeGroupStateChecker asyncnodegroups.AsyncNodeGroupStateChecker | ||
| metrics metricObserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave blank line here, scaleStateNotifier shouldn't require locking, so let's be clear about that: https://dmitri.shuralyov.com/idiomatic-go#mutex-hat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a blank line before scaleStateNotifier and put it after scaleUpFailures, because they do require a lock for it everywhere
| ErrorCode: string(reason), | ||
| ErrorMessage: errorMessage, | ||
| }, gpuResourceName, gpuType, currentTime) | ||
| csr.scaleUpFailures[nodeGroup.Id()] = append(csr.scaleUpFailures[nodeGroup.Id()], ScaleUpFailure{NodeGroup: nodeGroup, Reason: metrics.FailedScaleUpReason(errorInfo.ErrorCode), Time: currentTime}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csr.scaleUpFailures is not very widely used. And it is cleaned up at the beginning of every loop. It would be good to rethink if we really need it , or could just move to scaleStateNotifier callback chain. But it might be a bigger task on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csr.scaleUpFailures are used only in clusterstate tests. I think it is good to still have them for test coverage for now. Anyway, I made the struct and all its fields private, since used only in this pkg.
769395d to
88725c0
Compare
…odeGroupChangeObserver instance
88725c0 to
1d22dff
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
In
cluster-autoscaler/clusterstate/clusterstate.goin some cases we callregisterFailedScaleUpNoLock()which is not informing allNodeGroupChangeObserversproperly -> it needs to be done throughNodeGroupChangeObserverList.RegisterFailedScaleUp()We report the
RegisterFailedScaleUpmetric under theClusterStateRegistry.lock()which is also not neededSpecial notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: