Skip to content

Conversation

@shaikenov
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

  • In cluster-autoscaler/clusterstate/clusterstate.go in some cases we call registerFailedScaleUpNoLock() which is not informing all NodeGroupChangeObservers properly -> it needs to be done through NodeGroupChangeObserverList.RegisterFailedScaleUp()

  • We report the RegisterFailedScaleUp metric under the ClusterStateRegistry.lock() which is also not needed

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/needs-area labels Jan 28, 2026
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 28, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @shaikenov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shaikenov
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Jan 28, 2026
@shaikenov
Copy link
Contributor Author

This is a rebased version of #9083.

Will leave replies to the relevant comments here for convenience:

  • Why to use the errorInfo cloudprovider.InstanceErrorInfo instead of reason string, errMsg string?

    -> after the refactoring RegisterFailedScaleUp() in ClusterStateRegistry calls csr.backoffNodeGroup(nodeGroup, errorInfo, currentTime) which requires full InstanceErrorInfo. handleInstanceCreationErrorsForNodeGroup can report error with different ErrorClass, so I think in order to not lose any information it it easier to just wrap everything in cloudprovider.InstanceErrorInfo and pass it to RegisterFailedScaleUp.

  • Why not interface NodeGroupChangeObserver?

    -> If I use NodeGroupChangeObserverList I can call Register() directly and not to worry about what is the actual implementation underneath of NodeGroupChangeObserver. IMO in this place we will always have NodeGroupChangeObserverList, so there is no need to be more generic.

  • Have you checked if logic of these functions (beside registerFailedScaleUpNoLock) really require locking?
    Some of methods could be even changed to free functions eg. buildInstanceToErrorCodeMappings.

    -> In this PR I wanted to replicate the behavior already in-place with only moving out the metric reporting logic. Maybe we can address it in the followups

We already had the RegisterFailedScaleUp metric unit tests in place. In this PR I rearranged things a bit and since we do not mock/report the metric from ClusterStateRegistry and rather from NodeGroupChangeMetricsProducer that part did require some clean up.

cc: @MartynaGrotek


type metricObserver interface {
RegisterFailedScaleUp(reason metrics.FailedScaleUpReason, gpuResourceName, gpuType string)
type failedScaleUpInfo struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered reusing and extending already existing struct ScaleUpFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I thought that I did not want to change a lot of tests which are using the csr.scaleUpFailes by adding new fields. But I see that it can be done -> modified to use the scaleUpFailure by adding new gpu fields. I prefer not to change the RegisterFailedScaleUp's csr.scaleUpFailures map population since we already have all the fields that we are interested in and gpu fields population will bring zero additional value, IMO

nodeGroup cloudprovider.NodeGroup
errorInfo cloudprovider.InstanceErrorInfo
gpuResourceName string
gpuType string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields seems to be used only for reporting metrics and can be evaluated based on nodeGroup.
We could also consider moving that logic out of here to NodeGroupChangeObserver and get rid of , gpuResourceName, gpuType string from the interface method RegisterFailedScaleUp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it can be moved, but for the extraction of gpuResourceName, gpuType string from the nodeGroup we would need to pass the cloudProvider or GpuConfig objects to RegisterFailedScaleUp interface anyway cause the GetNodeGpuConfig() is the method of cloud provider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Cloud provider could be just passed to the instance of NodeGroupChangeObserver during creation.

ErrorMessage: errorMessage,
}, gpuResourceName, gpuType, currentTime)
csr.scaleUpFailures[nodeGroup.Id()] = append(csr.scaleUpFailures[nodeGroup.Id()], ScaleUpFailure{NodeGroup: nodeGroup, Reason: metrics.FailedScaleUpReason(errorInfo.ErrorCode), Time: currentTime})
csr.backoffNodeGroup(nodeGroup, errorInfo, currentTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does backoff require locking? From what I see we are not locking when calling BackoffStatusForNodeGroup().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need locking for exponentialBackoff.Backoff() to properly backoff all failing node groups, and it was implemented like this since the beginning. I am not 100% sure that it is safe to remove it

interrupt chan struct{}
nodeGroupConfigProcessor nodegroupconfig.NodeGroupConfigProcessor
asyncNodeGroupStateChecker asyncnodegroups.AsyncNodeGroupStateChecker
metrics metricObserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave blank line here, scaleStateNotifier shouldn't require locking, so let's be clear about that: https://dmitri.shuralyov.com/idiomatic-go#mutex-hat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a blank line before scaleStateNotifier and put it after scaleUpFailures, because they do require a lock for it everywhere

ErrorCode: string(reason),
ErrorMessage: errorMessage,
}, gpuResourceName, gpuType, currentTime)
csr.scaleUpFailures[nodeGroup.Id()] = append(csr.scaleUpFailures[nodeGroup.Id()], ScaleUpFailure{NodeGroup: nodeGroup, Reason: metrics.FailedScaleUpReason(errorInfo.ErrorCode), Time: currentTime})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csr.scaleUpFailures is not very widely used. And it is cleaned up at the beginning of every loop. It would be good to rethink if we really need it , or could just move to scaleStateNotifier callback chain. But it might be a bigger task on its own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csr.scaleUpFailures are used only in clusterstate tests. I think it is good to still have them for test coverage for now. Anyway, I made the struct and all its fields private, since used only in this pkg.

@shaikenov shaikenov force-pushed the shaikenov-provisioning-req-impl branch from 769395d to 88725c0 Compare February 2, 2026 15:41
@shaikenov shaikenov force-pushed the shaikenov-provisioning-req-impl branch from 88725c0 to 1d22dff Compare February 3, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants