Extract the RegisterFailedScaleUp metric generation into a separate NodeGroupChangeObserver instance #9136

shaikenov · 2026-01-28T15:31:17Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

In cluster-autoscaler/clusterstate/clusterstate.go in some cases we call registerFailedScaleUpNoLock() which is not informing all NodeGroupChangeObservers properly -> it needs to be done through NodeGroupChangeObserverList.RegisterFailedScaleUp()
We report the RegisterFailedScaleUp metric under the ClusterStateRegistry.lock() which is also not needed

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2026-01-28T15:31:28Z

Hi @shaikenov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-01-28T15:31:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shaikenov
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaikenov · 2026-01-28T16:01:02Z

This is a rebased version of #9083.

Will leave replies to the relevant comments here for convenience:

Why to use the errorInfo cloudprovider.InstanceErrorInfo instead of reason string, errMsg string?

-> after the refactoring RegisterFailedScaleUp() in ClusterStateRegistry calls csr.backoffNodeGroup(nodeGroup, errorInfo, currentTime) which requires full InstanceErrorInfo. handleInstanceCreationErrorsForNodeGroup can report error with different ErrorClass, so I think in order to not lose any information it it easier to just wrap everything in cloudprovider.InstanceErrorInfo and pass it to RegisterFailedScaleUp.
Why not interface NodeGroupChangeObserver?

-> If I use NodeGroupChangeObserverList I can call Register() directly and not to worry about what is the actual implementation underneath of NodeGroupChangeObserver. IMO in this place we will always have NodeGroupChangeObserverList, so there is no need to be more generic.
Have you checked if logic of these functions (beside registerFailedScaleUpNoLock) really require locking?
Some of methods could be even changed to free functions eg. buildInstanceToErrorCodeMappings.

-> In this PR I wanted to replicate the behavior already in-place with only moving out the metric reporting logic. Maybe we can address it in the followups

We already had the RegisterFailedScaleUp metric unit tests in place. In this PR I rearranged things a bit and since we do not mock/report the metric from ClusterStateRegistry and rather from NodeGroupChangeMetricsProducer that part did require some clean up.

cc: @MartynaGrotek

MartynaGrotek · 2026-02-01T08:46:23Z