HYPERFLEET-536: Optimize Sentinel with server-side condition-based filtering#60
HYPERFLEET-536: Optimize Sentinel with server-side condition-based filtering#60rafabene wants to merge 6 commits intoopenshift-hyperfleet:mainfrom
Conversation
…ltering Replace single fetch-all query with two targeted TSL queries to reduce network traffic and API load: 1. Not-ready resources: status.conditions.Ready='False' 2. Stale ready resources: Ready='True' with last_updated_time older than max_age_ready Results are merged and deduplicated by resource ID before evaluation. Includes graceful degradation if the stale query fails.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughFetchResources was extended to accept variadic additionalFilters and now builds a composite TSL search string via buildSearchString which merges label selectors and extra filters. fetchResourcesOnce signature was updated to receive additionalFilters and forwards the composed search to cluster/node-pool fetches. Sentinel was changed to perform two condition-based queries per poll (not-ready and stale-ready), merge results with first-slice precedence via mergeResources, and gracefully degrade if the stale query fails. Tests and integration harnesses were updated to cover composite search strings, two-query behavior, merging, and poll-cycle semantics. Sequence Diagram(s)sequenceDiagram
participant Sentinel as Sentinel
participant Client as HyperFleet Client
participant Server as HyperFleet Server
participant Merge as mergeResources
participant Eval as Decision Evaluator
participant Pub as Broker Publisher
Sentinel->>Client: FetchResources(labelSelector, additionalFilters)
Client->>Server: GET /resources?search=<labels AND Ready='False' [+ extra filters]>
Server-->>Client: Not-ready resources (A)
Client->>Server: GET /resources?search=<labels AND Ready='True' AND age filter [+ extra filters]>
Server-->>Client: Stale-ready resources (B) or error
Client->>Merge: mergeResources(A, B)
Merge->>Merge: deduplicate (first-slice precedence)
Merge-->>Sentinel: merged resources
Sentinel->>Eval: evaluate decisions
Eval->>Pub: publish events
Pub-->>Sentinel: publish results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
internal/sentinel/sentinel_test.go (1)
125-132: Make unexpected search queries fail fast in the test helper.The
defaultbranch currently returns merged clusters, which can hide query-construction regressions. Prefer returning400(or explicit test failure) for unknown search patterns.Suggested fix
switch { case strings.Contains(search, "Ready='False'"): clusters = notReadyClusters case strings.Contains(search, "Ready='True'"): clusters = staleClusters default: - clusters = append(notReadyClusters, staleClusters...) + w.WriteHeader(http.StatusBadRequest) + if err := json.NewEncoder(w).Encode(map[string]string{ + "error": fmt.Sprintf("unexpected search query: %s", search), + }); err != nil { + t.Logf("Error encoding bad request response: %v", err) + } + return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel_test.go` around lines 125 - 132, The test helper's switch on the variable search currently falls through to merging clusters in the default branch, hiding bad query changes; replace the default branch so it fails fast by calling t.Fatalf("unexpected search query: %q", search) (or otherwise return an explicit 400/error) instead of appending notReadyClusters and staleClusters; update the switch in sentinel_test.go that references search, clusters, notReadyClusters and staleClusters so unknown search patterns trigger the test failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel.go`:
- Around line 220-224: The code in trigger() treats any error from
s.client.FetchResources (staleResources) as a graceful success by returning
notReadyResources, nil; change this so context cancellation/timeouts are
propagated: after calling s.client.FetchResources(ctx, rt, labelSelector,
staleFilter) check if errors.Is(err, context.Canceled) || errors.Is(err,
context.DeadlineExceeded) and if so return nil, err (propagate the context
error) instead of returning notReadyResources, nil; for non-context errors keep
the existing graceful-degradation behavior. Ensure you import the errors package
and reference s.client.FetchResources, staleResources and trigger() when making
the change.
- Around line 233-244: mergeResources currently dedupes solely on Resource.ID so
entries with empty IDs collapse; change the dedupe key logic in mergeResources
to treat empty IDs as distinct by using a fallback unique key (e.g., prefix +
the index or source) when resource.ID == "" (for example "__empty_a_{i}" for
items from slice a and "__empty_b_{i}" for items from slice b) so each empty-ID
resource is preserved; update the seen map population and existence check to use
this computed key (and add fmt import if you use fmt.Sprintf).
In `@test/integration/integration_test.go`:
- Around line 122-133: The test mutates pollCycleCount and receivedSearchParams
from HTTP handler goroutines without synchronization, causing data races; fix by
introducing synchronization (e.g., a sync.Mutex or sync.RWMutex or atomic
counters) and use it wherever these shared variables are written or read: wrap
increments to pollCycleCount (referenced in the server handler closure) with
atomic.AddInt32 or lock/unlock around the mutation, and protect accesses and
appends to receivedSearchParams (the slice/map populated in the handler and
later asserted in the test) with the same mutex (lock in handler when mutating,
lock or RLock in the test when reading/asserting). Ensure all other handler/test
sites noted in the comment (the other handler blocks that touch these variables)
use the same synchronization primitives.
---
Nitpick comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 125-132: The test helper's switch on the variable search currently
falls through to merging clusters in the default branch, hiding bad query
changes; replace the default branch so it fails fast by calling
t.Fatalf("unexpected search query: %q", search) (or otherwise return an explicit
400/error) instead of appending notReadyClusters and staleClusters; update the
switch in sentinel_test.go that references search, clusters, notReadyClusters
and staleClusters so unknown search patterns trigger the test failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c9cf8e0d-0369-4d07-9924-e89505ac244c
📒 Files selected for processing (5)
internal/client/client.gointernal/client/client_test.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gotest/integration/integration_test.go
- Propagate context cancellation/timeout errors in fetchFilteredResources instead of swallowing them via graceful degradation - Handle empty resource IDs in mergeResources so they are treated as distinct entries instead of collapsing - Add synchronization for shared variables in integration tests (atomic counters for pollCycleCount/callCount, mutex for receivedSearchParams) - Fail fast in mockServerForConditionQueries on unexpected search queries instead of silently merging results
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
test/integration/integration_test.go (1)
464-468: Fail fast on unexpected search queries in this mock handler.The default branch currently returns valid clusters (Line 464-468), which can mask regressions to non-condition query behavior. Prefer returning an error and failing the test on unexpected query shapes.
Suggested fail-fast default branch
default: - clusters = []map[string]interface{}{ - createMockCluster("cluster-old", 2, 2, true, now.Add(-35*time.Minute)), - createMockCluster("cluster-not-ready", 1, 1, false, now.Add(-15*time.Second)), - } + t.Errorf("unexpected search query: %q", search) + http.Error(w, "unexpected search query", http.StatusBadRequest) + return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/integration_test.go` around lines 464 - 468, The default branch of the mock search handler is returning valid clusters which can hide unexpected query shapes; locate the switch/default in the mock handler that assigns to clusters (it currently uses createMockCluster("cluster-old", ...) and createMockCluster("cluster-not-ready", ...)), and change the default behavior to fail fast instead of returning data — e.g., call t.Fatalf (or otherwise return an error/HTTP 400) with a clear message including the unexpected query shape so the test fails immediately on unrecognized queries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 808-866: The test tracks queryCount but never asserts it, so add
an assertion after calling s.trigger(ctx) to ensure both backend queries were
attempted (i.e., queryCount == 2); place this check immediately after the
existing publish assertion and fail the test with a clear message if queryCount
is not 2 so the test will catch cases where the stale-ready query is skipped.
Reference the queryCount variable and the call to s.trigger(ctx) when adding the
assertion.
In `@test/integration/integration_test.go`:
- Around line 226-247: The test TestIntegration_LabelSelectorFiltering's mock
server currently ignores validating the actual outbound search query; modify the
httptest server handler (the server variable that reads searchParam) to record
each incoming searchParam into a slice or channel and then assert in the test
that the captured search values contain the expected condition fragments (e.g.,
"labels.shard='1'" plus the appropriate "Ready='True'" and "Ready='False'"
clauses) to ensure query composition is exercised; use the existing
createMockClusterWithLabels helper and searchParam variable to locate the
handler and add the capture/assert logic.
- Around line 390-400: The test currently only checks prefixes and can silently
pass if no queries were captured; update the validation around
receivedSearchParams: first assert that len(receivedSearchParams) > 0 (or use
paramsCopy) to ensure at least one query was observed, then assert that among
paramsCopy there exists at least one query containing "Ready='False'" and at
least one containing "Ready='True'"; keep the existing prefix check
(labelPrefix) but add these presence assertions to fail the test when queries
are missing or when either Ready condition is absent.
---
Nitpick comments:
In `@test/integration/integration_test.go`:
- Around line 464-468: The default branch of the mock search handler is
returning valid clusters which can hide unexpected query shapes; locate the
switch/default in the mock handler that assigns to clusters (it currently uses
createMockCluster("cluster-old", ...) and createMockCluster("cluster-not-ready",
...)), and change the default behavior to fail fast instead of returning data —
e.g., call t.Fatalf (or otherwise return an error/HTTP 400) with a clear message
including the unexpected query shape so the test fails immediately on
unrecognized queries.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 716baa0d-106a-473c-b48c-071c9923c566
📒 Files selected for processing (3)
internal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gotest/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/sentinel/sentinel.go
- Assert queryCount in TestTrigger_StaleQueryFailure to verify both not-ready and stale queries were attempted - Capture and assert search params in TestIntegration_LabelSelectorFiltering to validate label selector and condition fragments in outbound queries - Assert receivedSearchParams is non-empty and contains both Ready='False' and Ready='True' queries in TestIntegration_TSLSyntaxMultipleLabels - Fail fast on unexpected search queries in BrokerLoggerContext mock handler
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/integration/integration_test.go (1)
193-195: Prefer condition-based waiting over fixed sleep to reduce CI flakiness.Assertions at Line 193, Line 313, and Line 434 depend on earlier fixed
time.Sleep(300ms). On slower runs this can produce false negatives even when behavior is correct.Refactor pattern
- time.Sleep(300 * time.Millisecond) - cancel() + deadline := time.Now().Add(2 * time.Second) + for time.Now().Before(deadline) { + if pollCycleCount.Load() >= 2 { + break + } + time.Sleep(25 * time.Millisecond) + } + cancel()Apply the same “wait-until condition or timeout” pattern for captured-query assertions in the other two tests.
Also applies to: 313-315, 434-436
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/integration_test.go` around lines 193 - 195, Replace the fixed 300ms sleeps with a condition-based wait: implement a short helper or inline loop that polls the condition until a timeout (e.g., 5s) and fails the test on timeout; specifically, for the pollCycleCount check replace the direct assertion of pollCycleCount.Load() < 2 with a wait-until that polls pollCycleCount.Load() >= 2 and then asserts, and apply the same pattern to the captured-query assertions referenced at the other two locations so they wait-for the expected captured queries (instead of relying on time.Sleep).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 724-729: The test mutates receivedSearchParams and queryCount from
the HTTP handler goroutines without synchronization; protect
receivedSearchParams with a sync.Mutex (e.g., add var receivedMu sync.Mutex and
wrap append/read of receivedSearchParams in receivedMu.Lock()/Unlock()) and
convert queryCount to an atomic.Int32 (or use atomic.AddInt32/atomic.LoadInt32)
so increments in the handler use atomic.Add and reads in the test use
atomic.Load, updating references to receivedSearchParams and queryCount in the
handler and the test assertions accordingly (look for the httptest.NewServer
handler closure that appends to receivedSearchParams and the places the test
reads receivedSearchParams/queryCount).
---
Nitpick comments:
In `@test/integration/integration_test.go`:
- Around line 193-195: Replace the fixed 300ms sleeps with a condition-based
wait: implement a short helper or inline loop that polls the condition until a
timeout (e.g., 5s) and fails the test on timeout; specifically, for the
pollCycleCount check replace the direct assertion of pollCycleCount.Load() < 2
with a wait-until that polls pollCycleCount.Load() >= 2 and then asserts, and
apply the same pattern to the captured-query assertions referenced at the other
two locations so they wait-for the expected captured queries (instead of relying
on time.Sleep).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2194d8f0-2df4-446c-81ef-b2b1bc194592
📒 Files selected for processing (2)
internal/sentinel/sentinel_test.gotest/integration/integration_test.go
- Protect receivedSearchParams with sync.Mutex in TestTrigger_ConditionBasedQueries (append in handler, read in test) - Convert queryCount to atomic.Int32 in TestTrigger_StaleQueryFailure (increment in handler, read in test)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 823-839: The test handler currently treats any non-Ready='False'
search as a stale-query failure; change the logic to explicitly check for
Ready='True' (e.g., replace the catch-all else with else if
strings.Contains(search, "Ready='True'") to return the internal server error
stub) and make the final else return a distinct failure for unknown/malformed
queries (for example write http.StatusBadRequest with an explanatory body or
call t.Fatalf) so unknown search strings cause the test to fail; apply the same
explicit Ready='True' check to the other occurrence referenced around
createMockCluster / createMockClusterList so only the intended stale-query input
triggers the 500 response.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d08d6965-4b27-4af9-8ef9-d848b95eebc9
📒 Files selected for processing (1)
internal/sentinel/sentinel_test.go
…test Replace catch-all else with explicit Ready='True' check so only the intended stale query triggers the 500 response. Unknown search patterns now fail fast with 400 and t.Errorf.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/sentinel/sentinel_test.go (1)
729-752: Add a default case to fail on unexpected queries for consistency.Unlike
mockServerForConditionQueries(line 132-135) andTestTrigger_StaleQueryFailure(line 839-842), this handler silently returns an empty cluster list when neitherReady='False'norReady='True'is matched. This makes the test less robust — if the search string format changes unexpectedly, the test would pass with empty results rather than failing fast.♻️ Suggested improvement
} else if strings.Contains(search, "Ready='True'") { clusters = []map[string]interface{}{ createMockCluster("stale-1", 2, 2, true, now.Add(-31*time.Minute)), } + } else { + t.Errorf("unexpected search query: %q", search) + w.WriteHeader(http.StatusBadRequest) + return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel_test.go` around lines 729 - 752, The test HTTP handler created in the anonymous function for the httptest server silently returns an empty cluster list when the query doesn't match "Ready='False'" or "Ready='True'"; update the handler inside the server := httptest.NewServer(http.HandlerFunc(...)) so it has a default branch that fails the test (e.g., t.Fatalf or t.Errorf) when an unexpected search string is received, referencing the received search parameter (receivedSearchParams or search) and the response-building helpers (createMockCluster/createMockClusterList) to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 729-752: The test HTTP handler created in the anonymous function
for the httptest server silently returns an empty cluster list when the query
doesn't match "Ready='False'" or "Ready='True'"; update the handler inside the
server := httptest.NewServer(http.HandlerFunc(...)) so it has a default branch
that fails the test (e.g., t.Fatalf or t.Errorf) when an unexpected search
string is received, referencing the received search parameter
(receivedSearchParams or search) and the response-building helpers
(createMockCluster/createMockClusterList) to locate the code to change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: db226001-df7a-47ab-857c-916c74b82ada
📒 Files selected for processing (1)
internal/sentinel/sentinel_test.go
When there is a generation change, the resource immediately gets Ready=False, so the sentinel will trigger because of this reason and will not take into account the most recent time it was Ready=True So, in theory there should be no delay, or I didn't understood the use case |
| // | ||
| // Results are merged and deduplicated by resource ID. The DecisionEngine still | ||
| // evaluates the filtered set in memory (e.g., for generation-based checks). | ||
| func (s *Sentinel) fetchFilteredResources(ctx context.Context, labelSelector map[string]string) ([]client.Resource, error) { |
There was a problem hiding this comment.
A possible optimization would be to fetch in parallel.
But we can leave that for later if necessary.
My concern would be the timeout for the requests, we have a timeout: 5s which means that both requests timing out will consume all the sentinel cycle.
But, checking the implementation, it seems that a cycle that takes longer will not mix with another sentinel execution, which will happen 5secs after the previous one has finished,
There was a problem hiding this comment.
Agreed, parallelizing the queries is a valid optimization. In practice, the
gains would only be significant in worst-case scenarios (e.g., both queries
timing out — 10s sequential vs 5s parallel). For typical fast responses, the
difference is negligible. I'll leave this as a future optimization if we
observe slow query times in production.
internal/sentinel/sentinel.go
Outdated
|
|
||
| // Query 1: Not-ready resources | ||
| notReadyResources, err := s.client.FetchResources(ctx, rt, labelSelector, | ||
| "status.conditions.Ready='False'") |
There was a problem hiding this comment.
Would it make sense to have these queries as constants to give them some name and be more easily identifiable?
There was a problem hiding this comment.
Good idea. Extracted both filters as package-level constants: notReadyFilter and staleReadyFilterFmt.
internal/sentinel/sentinel.go
Outdated
|
|
||
| // mergeResources combines two resource slices, deduplicating by resource ID. | ||
| // Resources from the first slice take precedence when duplicates are found. | ||
| // Resources with empty IDs are treated as distinct and always preserved. |
There was a problem hiding this comment.
How can we have a resource with empty ID?
That is basically what gets sent in the messages, shouldn't we discard these as errors?
There was a problem hiding this comment.
You're right, resources with empty IDs shouldn't be processed. Added an early
filter in the trigger loop that skips resources with empty ID and logs a
warning. This also allowed simplifying mergeResources — no more special
handling for empty IDs.
| // Fetch resources from HyperFleet API | ||
| resources, err := s.client.FetchResources(ctx, client.ResourceType(s.config.ResourceType), labelSelector) | ||
| // Fetch resources using condition-based server-side filtering: | ||
| // Query 1: Not-ready resources (need frequent reconciliation) |
There was a problem hiding this comment.
I wonder if there is a way to have a TSL query with OR conditions and have a single query?
But checking the API code, is not even possible now to search by status.conditions.Ready.last_updated_time
There was a problem hiding this comment.
Right, TSL doesn't support OR conditions and last_updated_time filtering isn't
available in the API yet. If either becomes available in the future, we can
consolidate into a single query.
- Replace V(2) logging with Debug/Debugf for simplicity - Extract filter query strings into named constants - Skip resources with empty ID and log a warning - Simplify mergeResources by removing empty ID special handling
Summary
status.conditions.Ready='False'), Query 2 fetches stale ready resources (Ready='True'withlast_updated_timeolder thanmax_age_ready)Changes
internal/client/client.goadditionalFilters ...stringtoFetchResources, newbuildSearchStringfunctioninternal/sentinel/sentinel.gofetchFilteredResourcesmethod (two queries),mergeResourcesdedup helperinternal/client/client_test.goTestBuildSearchString,TestFetchResources_WithAdditionalFilters,TestFetchResources_WithConditionFilterOnlyinternal/sentinel/sentinel_test.gomockServerForConditionQuerieshelper, 3 new tests + 7 existing tests updated for two-query patterntest/integration/integration_test.goTrade-off
Resources that recently became Ready may have generation-based reconciliation delayed up to
max_age_ready(30min), since they are only fetched when stale. This is acceptable becauseReady=Truealready indicates the adapter has processed the resource.Test plan
make test)make lint)make test-integration) — requires Docker/Podman environmentopenshift-hyperfleet/architectureJIRA: HYPERFLEET-536
Summary by CodeRabbit
New Features
Tests