HYPERFLEET-585 - doc: create reliability documentation (runbook and metrics)#68
HYPERFLEET-585 - doc: create reliability documentation (runbook and metrics)#68xueli181114 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughAdds three documentation files for the hyperfleet-adapter: docs/metrics.md (detailed metrics, PromQL queries, recommended alerting rules, metrics served on port 9090 and health endpoints on port 8080 with /healthz and /readyz), docs/runbook.md (operational runbook covering failure modes, troubleshooting steps, recovery, verification commands, and escalation/escalation paths), and updates docs/observability.md to point to metrics.md for alerting and monitoring guidance. Sequence Diagram(s)(omitted) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 239-243: The "Errors by type" example uses
rate(hyperfleet_adapter_errors_total[5m]) which returns per-series rates but
does not group by the error_type label; update the example query to aggregate
rates grouped by error_type (e.g., wrap the rate(...) in a sum by
(error_type)(...) aggregation) so the output is explicitly grouped by the
error_type label as the heading describes.
- Around line 225-228: The success-rate query is wrongly keeping
status="success" on both numerator and denominator; change the denominator to
aggregate total traffic across statuses by removing the status selector and
summing the rate metric, e.g. use
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) /
sum(rate(hyperfleet_adapter_events_processed_total[5m])) * 100 so the numerator
uses hyperfleet_adapter_events_processed_total{status="success"} and the
denominator uses sum(rate(hyperfleet_adapter_events_processed_total[5m])) to
compute percentage of total events.
- Around line 158-162: The alert misfires because
rate(hyperfleet_adapter_events_processed_total[15m]) is evaluated per status
label; aggregate out status first by applying sum by(component, version) to the
rate expression so the comparison to zero is done across all statuses, and
ensure the resulting vector uses the same labels as the join
(hyperfleet_adapter_up) — i.e., replace the raw rate(...) == 0 with sum
by(component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) ==
0 so the alert only fires when total events across statuses are zero.
In `@docs/runbook.md`:
- Line 276: The table cell containing the phrase "May be config or resource
issue." is missing a subject; update that cell to a complete sentence such as
"This may be a config or resource issue." by replacing the existing text
fragment exactly where the row reads "| **P3 - Medium** | Single replica
crash-looping; other replicas healthy | Investigate logs. May be config or
resource issue. |".
- Around line 201-204: The runbook line uses curl which doesn't exercise gRPC;
replace that invocation with grpcurl and call the standard health service to
validate the Maestro gRPC endpoint (use grpcurl -plaintext
maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check),
ensuring you reference grpcurl and the grpc.health.v1.Health/Check method
against maestro-grpc.maestro.svc.cluster.local:8090.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d58129d4-09f7-40fe-bfbb-7d6a35c3102b
📒 Files selected for processing (3)
docs/metrics.mddocs/observability.mddocs/runbook.md
4ea607b to
69dc64d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
docs/metrics.md (3)
145-145:⚠️ Potential issue | 🟡 Minor"Errors by type" query should group by
error_type.The heading indicates grouping by error type, but the query returns per-series rates without explicit grouping. Add
sum by (error_type)aggregation to match the heading.♻️ Proposed fix
-rate(hyperfleet_adapter_errors_total[5m]) +sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m]))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` at line 145, The "Errors by type" query uses rate(hyperfleet_adapter_errors_total[5m]) but doesn't aggregate by error_type; update the query that references hyperfleet_adapter_errors_total to wrap the rate in a sum by (error_type) aggregation (i.e., use sum by (error_type)(rate(hyperfleet_adapter_errors_total[5m]))) so results are grouped by the error_type label.
128-131:⚠️ Potential issue | 🟠 MajorSuccess-rate query divides success by success, not success by total.
The denominator will match on
status="success"from the numerator, so both sides evaluate to the same metric, yieldingsuccess/successinstead ofsuccess/total. Aggregate the denominator across status labels.🔧 Proposed fix
-rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) -/ -rate(hyperfleet_adapter_events_processed_total[5m]) -* 100 +sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) +) +/ +sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total[5m]) +) +* 100🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 128 - 131, The query incorrectly divides rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by a denominator that still includes {status="success"}, yielding success/success; change the denominator to aggregate across status labels (e.g., use rate(hyperfleet_adapter_events_processed_total[5m]) wrapped in a sum or appropriate aggregation) so the numerator (rate(...{status="success"}[5m])) is divided by the total rate across all statuses for hyperfleet_adapter_events_processed_total.
61-64:⚠️ Potential issue | 🟠 MajorNo-events alert misfires due to
statuslabel fan-out.The expression
rate(hyperfleet_adapter_events_processed_total[15m]) == 0is evaluated perstatuslabel, so it can alert even when events are flowing (e.g., whenstatus="success"has events butstatus="failed"is zero). Aggregate away thestatuslabel before the comparison to zero.🔧 Proposed fix
expr: | - rate(hyperfleet_adapter_events_processed_total[15m]) == 0 + sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total[15m]) + ) == 0 and on(component, version) hyperfleet_adapter_up == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 61 - 64, The alert fires per `status` because `rate(hyperfleet_adapter_events_processed_total[15m])` is not aggregated; change the expression to aggregate away `status` before comparing to zero — e.g., wrap the rate in a sum grouped by the same labels you join on (use sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m]))) and then compare that aggregated value to 0 so the `status` label cannot cause fan-out; keep the existing on(component, version) and hyperfleet_adapter_up == 1 logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 44-48: The alert expression mistakenly divides
rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]) by
rate(hyperfleet_adapter_events_processed_total[5m]) which performs
vector-matching on the status label and yields 1; change the denominator to
aggregate across status labels (e.g., use
sum(rate(hyperfleet_adapter_events_processed_total[5m])) ) so the expression
becomes failed-rate / total-rate and correctly computes failed/total.
---
Duplicate comments:
In `@docs/metrics.md`:
- Line 145: The "Errors by type" query uses
rate(hyperfleet_adapter_errors_total[5m]) but doesn't aggregate by error_type;
update the query that references hyperfleet_adapter_errors_total to wrap the
rate in a sum by (error_type) aggregation (i.e., use sum by
(error_type)(rate(hyperfleet_adapter_errors_total[5m]))) so results are grouped
by the error_type label.
- Around line 128-131: The query incorrectly divides
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by a
denominator that still includes {status="success"}, yielding success/success;
change the denominator to aggregate across status labels (e.g., use
rate(hyperfleet_adapter_events_processed_total[5m]) wrapped in a sum or
appropriate aggregation) so the numerator (rate(...{status="success"}[5m])) is
divided by the total rate across all statuses for
hyperfleet_adapter_events_processed_total.
- Around line 61-64: The alert fires per `status` because
`rate(hyperfleet_adapter_events_processed_total[15m])` is not aggregated; change
the expression to aggregate away `status` before comparing to zero — e.g., wrap
the rate in a sum grouped by the same labels you join on (use sum by (component,
version) (rate(hyperfleet_adapter_events_processed_total[15m]))) and then
compare that aggregated value to 0 so the `status` label cannot cause fan-out;
keep the existing on(component, version) and hyperfleet_adapter_up == 1 logic.
There was a problem hiding this comment.
♻️ Duplicate comments (6)
docs/runbook.md (2)
201-204:⚠️ Potential issue | 🟠 MajorLine 201–204: use
grpcurlfor Maestro gRPC health validation instead ofcurl.
curldoes not exercise gRPC health RPCs, so this check can be misleading for protocol-level validation.Proposed fix
2. Test gRPC connectivity from the adapter pod: ```bash - kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090 + kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/runbook.md` around lines 201 - 204, Replace the curl-based connectivity test with a gRPC health check using grpcurl: in the command that currently runs curl against maestro-grpc.maestro.svc.cluster.local:8090, invoke grpcurl with -plaintext and call the grpc.health.v1.Health/Check RPC (e.g., kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check) so the check exercises the gRPC health service rather than an HTTP probe.
276-276:⚠️ Potential issue | 🟡 MinorLine 276: sentence fragment in escalation action cell.
Use a complete sentence for clarity in on-call guidance.
Proposed fix
-| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. May be config or resource issue. | +| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/runbook.md` at line 276, The escalation action table row for "P3 - Medium" currently contains a sentence fragment ("Investigate logs. May be config or resource issue."); update that cell to a single complete sentence with clear guidance — for example: "Investigate the crashing replica's logs and determine whether the issue is caused by configuration or resource constraints, then remediate accordingly." Ensure the revised text replaces the fragment in the same table row so the on-call runbook reads as a full sentence.docs/metrics.md (4)
142-146:⚠️ Potential issue | 🟡 MinorLine 142–146: “Errors by type” example should explicitly group by
error_type.Current query returns per-series rates but does not reflect the heading intent.
Proposed fix
-rate(hyperfleet_adapter_errors_total[5m]) +sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m]))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 142 - 146, The "Errors by type" example currently uses rate(hyperfleet_adapter_errors_total[5m]) which returns per-series rates instead of aggregating by error type; update the PromQL to aggregate by the error_type label (e.g. use sum by (error_type)(rate(hyperfleet_adapter_errors_total[5m]))) so the query groups error rates per error_type; ensure the metric hyperfleet_adapter_errors_total contains the error_type label referenced.
44-48:⚠️ Potential issue | 🔴 CriticalLine 44–48: failure-rate alert currently computes failed/failed, not failed/total.
Denominator is matched on
status="failed"by default vector matching, so this can evaluate incorrectly as 1. Aggregate acrossstatusfirst.Proposed fix
expr: | - rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]) - / - rate(hyperfleet_adapter_events_processed_total[5m]) + sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]) + ) + / + sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total[5m]) + ) > 0.1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 44 - 48, The alert's denominator currently matches only status="failed" due to vector matching, producing failed/failed; change the denominator to aggregate across status first by summing the rate of hyperfleet_adapter_events_processed_total over all statuses (e.g., use sum(rate(hyperfleet_adapter_events_processed_total[5m])) ) so the expression becomes failed-rate / total-rate and correctly computes failure rate.
62-64:⚠️ Potential issue | 🟠 MajorLine 62–64: no-events alert can misfire due to
statuslabel fan-out.
rate(...events_processed_total...) == 0is evaluated per status series. Sum across statuses before comparing to zero.Proposed fix
expr: | - rate(hyperfleet_adapter_events_processed_total[15m]) == 0 + sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total[15m]) + ) == 0 and on(component, version) hyperfleet_adapter_up == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 62 - 64, The alert currently compares rate(hyperfleet_adapter_events_processed_total[15m]) == 0 per-status, causing fan-out; change the expression to sum across statuses before comparing to zero by wrapping the rate(...) in a sum grouped by component and version (e.g., use sum by(component, version) (rate(hyperfleet_adapter_events_processed_total[15m]))) and keep the uptime check using hyperfleet_adapter_up == 1 joined on component and version; update the alert expression to use these symbols (hyperfleet_adapter_events_processed_total, rate(...), sum by(component, version), and hyperfleet_adapter_up) so the zero check is performed on the aggregated series rather than per-status series.
128-131:⚠️ Potential issue | 🟠 MajorLine 128–131: success-rate query divides success by success instead of success by total.
Default matching keeps
status="success"on both sides. Aggregate denominator over all statuses.Proposed fix
-rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) -/ -rate(hyperfleet_adapter_events_processed_total[5m]) +sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) +) +/ +sum by (component, version) ( + rate(hyperfleet_adapter_events_processed_total[5m]) +) * 100🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 128 - 131, The success-rate query currently divides rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by rate(hyperfleet_adapter_events_processed_total[5m]) which keeps status="success" on both sides and yields 100%; fix by aggregating the denominator across all statuses so it doesn't match the success label — for example replace the denominator with an aggregation like sum without(status)(rate(hyperfleet_adapter_events_processed_total[5m])) (or sum(rate(hyperfleet_adapter_events_processed_total[5m])) by (<other labels>)) so the numerator rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) is divided by the total across all statuses and then multiply by 100.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/metrics.md`:
- Around line 142-146: The "Errors by type" example currently uses
rate(hyperfleet_adapter_errors_total[5m]) which returns per-series rates instead
of aggregating by error type; update the PromQL to aggregate by the error_type
label (e.g. use sum by (error_type)(rate(hyperfleet_adapter_errors_total[5m])))
so the query groups error rates per error_type; ensure the metric
hyperfleet_adapter_errors_total contains the error_type label referenced.
- Around line 44-48: The alert's denominator currently matches only
status="failed" due to vector matching, producing failed/failed; change the
denominator to aggregate across status first by summing the rate of
hyperfleet_adapter_events_processed_total over all statuses (e.g., use
sum(rate(hyperfleet_adapter_events_processed_total[5m])) ) so the expression
becomes failed-rate / total-rate and correctly computes failure rate.
- Around line 62-64: The alert currently compares
rate(hyperfleet_adapter_events_processed_total[15m]) == 0 per-status, causing
fan-out; change the expression to sum across statuses before comparing to zero
by wrapping the rate(...) in a sum grouped by component and version (e.g., use
sum by(component, version)
(rate(hyperfleet_adapter_events_processed_total[15m]))) and keep the uptime
check using hyperfleet_adapter_up == 1 joined on component and version; update
the alert expression to use these symbols
(hyperfleet_adapter_events_processed_total, rate(...), sum by(component,
version), and hyperfleet_adapter_up) so the zero check is performed on the
aggregated series rather than per-status series.
- Around line 128-131: The success-rate query currently divides
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by
rate(hyperfleet_adapter_events_processed_total[5m]) which keeps status="success"
on both sides and yields 100%; fix by aggregating the denominator across all
statuses so it doesn't match the success label — for example replace the
denominator with an aggregation like sum
without(status)(rate(hyperfleet_adapter_events_processed_total[5m])) (or
sum(rate(hyperfleet_adapter_events_processed_total[5m])) by (<other labels>)) so
the numerator
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) is divided
by the total across all statuses and then multiply by 100.
In `@docs/runbook.md`:
- Around line 201-204: Replace the curl-based connectivity test with a gRPC
health check using grpcurl: in the command that currently runs curl against
maestro-grpc.maestro.svc.cluster.local:8090, invoke grpcurl with -plaintext and
call the grpc.health.v1.Health/Check RPC (e.g., kubectl exec <pod> -- grpcurl
-plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check) so the check exercises the gRPC health service
rather than an HTTP probe.
- Line 276: The escalation action table row for "P3 - Medium" currently contains
a sentence fragment ("Investigate logs. May be config or resource issue.");
update that cell to a single complete sentence with clear guidance — for
example: "Investigate the crashing replica's logs and determine whether the
issue is caused by configuration or resource constraints, then remediate
accordingly." Ensure the revised text replaces the fragment in the same table
row so the on-call runbook reads as a full sentence.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a18461ce-0e3c-4f23-8a95-6f91c15546ba
📒 Files selected for processing (3)
docs/metrics.mddocs/observability.mddocs/runbook.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/observability.md
69dc64d to
e57c7b8
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/runbook.md (2)
276-276:⚠️ Potential issue | 🟡 MinorMake the P3 action sentence complete.
Line 276 uses a fragment (“May be config or resource issue.”). Please make it a full sentence.
Proposed doc fix
-| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. May be config or resource issue. | +| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/runbook.md` at line 276, Update the P3 row in the runbook table so the fragment "May be config or resource issue." is rewritten as a complete sentence; replace it with something like "It may be a configuration or resource issue." in the table cell for the P3 - Medium entry to ensure grammatical completeness.
201-204:⚠️ Potential issue | 🟠 MajorUse
grpcurl(notcurl) to validate Maestro gRPC health.Line 203 currently tests only TCP/HTTP reachability and does not validate a gRPC health RPC.
Proposed doc fix
2. Test gRPC connectivity from the adapter pod: ```bash - kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090 + kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/runbook.md` around lines 201 - 204, Replace the TCP/HTTP check that uses "kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090" with a proper gRPC health check using grpcurl; update the snippet to run "kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check" so the documentation validates the gRPC health RPC rather than only TCP/HTTP reachability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/runbook.md`:
- Line 276: Update the P3 row in the runbook table so the fragment "May be
config or resource issue." is rewritten as a complete sentence; replace it with
something like "It may be a configuration or resource issue." in the table cell
for the P3 - Medium entry to ensure grammatical completeness.
- Around line 201-204: Replace the TCP/HTTP check that uses "kubectl exec <pod>
-- curl -s maestro-grpc.maestro.svc.cluster.local:8090" with a proper gRPC
health check using grpcurl; update the snippet to run "kubectl exec <pod> --
grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check" so the documentation validates the gRPC health RPC
rather than only TCP/HTTP reachability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1e79fd9-f3da-4553-8717-a440abab491e
📒 Files selected for processing (3)
docs/metrics.mddocs/observability.mddocs/runbook.md
✅ Files skipped from review due to trivial changes (1)
- docs/observability.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/metrics.md
e57c7b8 to
538c55a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/runbook.md (2)
201-204:⚠️ Potential issue | 🟠 MajorUse
grpcurlfor gRPC endpoint validation, notcurl.The troubleshooting command uses
curlto test the Maestro gRPC endpoint, butcurlcannot natively invoke gRPC methods. It will fail to validate that the endpoint properly speaks the gRPC protocol.Replace with
grpcurltargeting the standard gRPC Health Checking Protocol to properly verify the Maestro gRPC endpoint is functional.🔧 Proposed fix
2. Test gRPC connectivity from the adapter pod: ```bash - kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090 + kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/runbook.md` around lines 201 - 204, The runbook uses curl to test the Maestro gRPC endpoint which cannot validate gRPC; replace that line so the pod exec uses grpcurl instead and call the standard gRPC Health Checking service (grpc.health.v1.Health/Check) against maestro-grpc.maestro.svc.cluster.local:8090 (e.g., via kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check) so the command actually verifies the gRPC protocol and service health.
276-276:⚠️ Potential issue | 🟡 MinorFix incomplete sentence.
The phrase "May be config or resource issue." is missing a subject. Complete the sentence for grammatical correctness.
✏️ Proposed fix
-| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. May be config or resource issue. | +| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/runbook.md` at line 276, The table row for "P3 - Medium" contains an incomplete sentence "May be config or resource issue."; update that cell (the row with "**P3 - Medium | Single replica crash-looping; other replicas healthy**") to a complete sentence such as "This may be a configuration or resource issue." (or "It may be a configuration or resource issue.") to fix the grammar while preserving the existing meaning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/runbook.md`:
- Around line 201-204: The runbook uses curl to test the Maestro gRPC endpoint
which cannot validate gRPC; replace that line so the pod exec uses grpcurl
instead and call the standard gRPC Health Checking service
(grpc.health.v1.Health/Check) against
maestro-grpc.maestro.svc.cluster.local:8090 (e.g., via kubectl exec <pod> --
grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check) so the command actually verifies the gRPC protocol
and service health.
- Line 276: The table row for "P3 - Medium" contains an incomplete sentence "May
be config or resource issue."; update that cell (the row with "**P3 - Medium |
Single replica crash-looping; other replicas healthy**") to a complete sentence
such as "This may be a configuration or resource issue." (or "It may be a
configuration or resource issue.") to fix the grammar while preserving the
existing meaning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 50e0d1c1-0e2b-4bf5-97c6-a945840d8497
📒 Files selected for processing (3)
docs/metrics.mddocs/observability.mddocs/runbook.md
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/metrics.md
- docs/observability.md
|
@coderabbitai what is your latest review result? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 81687
|
| Severity | Location | Issue |
|---|---|---|
| 🔴 Critical | Line 44–48 | HyperFleetAdapterHighFailureRate alert — denominator matches status="failed" via default vector matching, so the expression computes failed/failed (≈1) instead of failed/total. Fix: wrap both numerator and denominator in sum by (component, version)(...). |
| 🟠 Major | Line 61–64 | HyperFleetAdapterNoEventsProcessed alert — rate(...events_processed_total[15m]) == 0 is evaluated per status series, causing false alerts. Fix: aggregate with sum by (component, version)(rate(...)) before comparing to zero. |
| 🟠 Major | Line 128–131 | Success-rate PromQL query — divides success/success instead of success/total because default vector matching keeps status="success" on both sides. Fix: wrap both sides in sum by (component, version)(...). |
| 🟡 Minor | Line 142–146 | "Errors by type" example query — missing sum by (error_type) aggregation, so results are not grouped by error type as the heading implies. |
docs/runbook.md
| Severity | Location | Issue |
|---|---|---|
| 🟠 Major | Line 201–204 | curl is used to test the Maestro gRPC endpoint, but curl cannot invoke gRPC methods. Replace with: kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check |
| 🟡 Minor | Line 276 | Sentence fragment in the P3 escalation table: "May be config or resource issue." — add a subject, e.g. "This may be a config or resource issue." |
The most recent review run (2026-03-06T02:39:47Z, commit 538c55a0f) only re-checked docs/runbook.md (metrics.md was skipped as similar to previous changes). The PR is currently not approved. Would you like me to trigger a fresh full review once these issues are addressed?
…etrics) Add operational documentation for on-call engineers and observability: - docs/metrics.md: recommended alerting rules, thresholds, and PromQL queries (metric definitions live in observability.md) - docs/runbook.md: failure modes, troubleshooting steps, recovery procedures, and escalation paths - docs/observability.md: add cross-reference link to metrics.md - Aggregate events rate by (component, version) in dead man's switch alert - Group errors by error_type in example query - Replace curl with grpcurl for Maestro gRPC health check - Minor grammar fix in escalation table
73cc8df to
8ea83d8
Compare
|
@coderabbitai review 8ea83d8 again |
|
✅ Actions performedFull review triggered. |
does your review finish? You already take 10 mins. |
@coderabbitai how is your new review result? Post it |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 51937
✅
|
| Severity | Location | Issue | Status |
|---|---|---|---|
| 🔴 Critical | Line 44–48 | HyperFleetAdapterHighFailureRate — failed/failed vector matching |
✅ Fixed in 538c55a |
| 🟠 Major | Line 62–64 | HyperFleetAdapterNoEventsProcessed — per-status fan-out |
✅ Fixed in 73cc8df |
| 🟠 Major | Lines 225–228 | Success-rate PromQL — dividing success/success instead of success/total |
✅ Fixed in 538c55a |
| 🟡 Minor | Line 146 | "Errors by type" — missing sum by (error_type) aggregation |
✅ Fixed in 73cc8df |
✅ docs/runbook.md — All Issues Addressed
| Severity | Location | Issue | Status |
|---|---|---|---|
| 🟠 Major | Line 201–204 | curl used to test gRPC Maestro endpoint — replaced with grpcurl |
✅ Fixed in 73cc8df |
| 🟡 Minor | Line 276 | Sentence fragment in P3 escalation table | ✅ Fixed in 73cc8df |
The two most recent review runs (2026-03-06T02:28:52Z and 2026-03-06T02:39:47Z) produced no new actionable comments — only duplicates of previously resolved items. The PR looks clean. Would you like me to trigger a final review or approve the PR?
ciaranRoche
left a comment
There was a problem hiding this comment.
Nice work on this — having proper observability docs and a runbook is a big step for operational readiness. I went through all three files and cross-referenced against the codebase. Everything is very accurate (every log pattern, metric name, port, and endpoint checks out). Left a few inline suggestions below, nothing major.
| @@ -89,3 +89,7 @@ The adapter automatically registers Prometheus metrics from the [hyperfleet-brok | |||
| | `hyperfleet_broker_message_duration_seconds` | Histogram | Message processing duration | | |||
There was a problem hiding this comment.
The adapter metrics tables above have a Labels column but the broker table doesn't — might be worth adding one for consistency. In particular, messages_consumed_total and message_duration_seconds both have a topic label (you can see it from RecordConsumed(topic) and RecordDuration(topic, ...) in the broker library), but that's not mentioned here. Only errors_total calls out its labels inline.
Not a big deal, but it'd help someone writing PromQL queries to know about topic upfront.
There was a problem hiding this comment.
There should be a 1-1 relationship between:
- adapter (component name)
- subscription_id used by adapter
- topic associated with subscription_id
So, this means that the info on the topic may be redundant.
But having it in the metrics may expedite the problem research.
It could also be useful if somehow there are metrics for an adapter for an unexpected topic, that may reveal some configuration error
| ```yaml | ||
| alert: HyperFleetAdapterNoEventsProcessed | ||
| expr: | | ||
| sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0 |
There was a problem hiding this comment.
This alert has a subtle edge case — if the adapter has never processed any events (fresh deploy, or topic is empty), the counter doesn't exist yet, so rate() returns nothing, sum() returns nothing, and == 0 has nothing to evaluate against. The alert silently won't fire, which is exactly the scenario you'd want it to catch.
You could add an absent() fallback to handle this:
expr: |
(
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0
or
absent(hyperfleet_adapter_events_processed_total) == 1
)
and on(component, version)
hyperfleet_adapter_up == 1Or just add a note that this only fires once the first event has been processed — either way is fine.
| expr: | | ||
| histogram_quantile(0.95, | ||
| rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m]) | ||
| ) > 60 |
There was a problem hiding this comment.
Worth thinking about what happens when there are multiple adapter replicas. Right now histogram_quantile operates on the raw per-instance buckets. If two replicas scrape into the same Prometheus, you'd want to aggregate the buckets first:
histogram_quantile(0.95,
sum by (component, version, le) (
rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])
)
) > 60Not urgent if we're single-instance today, but worth a note or a TODO so it doesn't bite us later.
|
|
||
| Events are ACKed on failure and not automatically retried. To reprocess: | ||
| 1. Identify the failed event from logs (look for `event_id`) | ||
| 2. Republish the event to the broker topic |
There was a problem hiding this comment.
This step is a bit hand-wavy for someone debugging at 3am — they probably need a concrete command rather than "republish the event to the broker topic". Maybe add examples per broker type?
For Pub/Sub:
gcloud pubsub topics publish <topic> --message '<event-json>'For RabbitMQ:
rabbitmqadmin publish exchange=<exchange> routing_key=<key> payload='<event-json>'Even a placeholder would help point people in the right direction.
There was a problem hiding this comment.
We can point to the async API contract which contains the templates for messages
https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/broker/asyncapi.yaml
| kubectl exec <pod> -- curl -s localhost:8080/readyz | jq . | ||
|
|
||
| # Check metrics | ||
| kubectl exec <pod> -- curl -s localhost:9090/metrics | grep hyperfleet_adapter_up |
There was a problem hiding this comment.
The recovery verification checks that the pod is up and ready, but doesn't confirm events are actually flowing again. Might be worth adding a quick metrics check:
# Verify events are being processed successfully
kubectl exec <pod> -- curl -s localhost:9090/metrics | grep 'hyperfleet_adapter_events_processed_total'That way you know the adapter isn't just alive but actually doing work.
There was a problem hiding this comment.
As "doing work" may depend on messages being present on the queues, we could also ask to confirm there are messages the specific broker (e.g. pubsub)
| | `"context cancelled while waiting for resource deletion"` | Recreate timed out waiting for deletion | Resource may have finalizers preventing deletion | | ||
|
|
||
| Retryable K8s errors (automatically retried): timeouts, server unavailable, internal errors, rate limiting. | ||
| Non-retryable: forbidden, unauthorized, bad request, invalid, gone. |
There was a problem hiding this comment.
Small gap here — the code in pkg/errors/k8s_error.go also treats network errors (connection refused, reset, etc.) as retryable, and "method not supported" / "not acceptable" as non-retryable. Might be worth adding those for completeness since they can show up in practice.
| |----------|-----------|--------| | ||
| | **P1 - Critical** | Adapter down across all replicas; no events processing | Page on-call SRE. Check broker and API dependencies. | | ||
| | **P2 - High** | High failure rate (>10% events failing for 15+ min) | Notify team. Investigate error_type in metrics. | | ||
| | **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. | |
There was a problem hiding this comment.
Tiny grammar nit: "May be config or resource issue" reads a bit choppy — maybe "This may be a config or resource issue" to match the tone of the rest of the table.
| severity: warning | ||
| annotations: | ||
| summary: "No events processed" | ||
| description: "Adapter {{ $labels.component }} has not processed any events in 15 minutes." |
There was a problem hiding this comment.
Priority: Bug
The description on line 70 says "has not processed any events in 15 minutes"
but the actual delay before this alert fires is ~30 minutes.
rate(...[15m]) takes ~15 min to go to zero after the last event, then for: 15m
adds another 15 min on top. An on-call engineer seeing this alert would think
events stopped 15 min ago when it was actually closer to 30 min — that
matters for incident timelines.
A couple of options:
- Keep for: 15m and update the description to "...has not processed any
events in ~30 minutes." - Reduce for to something like 5m (just enough to avoid evaluation flapping)
and update the description to "...in ~20 minutes." — this gives faster
detection while still being stable.
The second option also reduces the gap between this alert and the
HyperFleetAdapterDown dead-man's-switch scenario — right now there's a 30 min
blind spot where the adapter could be silently not processing.
|
|
||
| ```yaml | ||
| alert: HyperFleetAdapterDown | ||
| expr: hyperfleet_adapter_up == 0 |
There was a problem hiding this comment.
Priority: Bug
hyperfleet_adapter_up is set to 0 only during graceful shutdown (confirmed in
pkg/health/metrics.go). When a pod actually crashes, the metric goes stale in
Prometheus — it doesn't become 0. So this alert won't fire on the most common
"adapter down" scenario (OOM, panic, node failure, etc.).
The NoEventsProcessed alert covers this gap eventually, but with a ~30 min
delay (see other comment). That's a long time for a critical-severity
situation.
Suggestion — add an absent() clause to cover both graceful shutdown and
crashes:
expr: >
hyperfleet_adapter_up == 0
or
absent(hyperfleet_adapter_up{component="hyperfleet-adapter"})Note: absent() will also fire if the metric has never been scraped (e.g.,
fresh Prometheus), so you might want to gate it with a condition or add a note
about expected initial noise. Alternatively, if operators already have
target-level up == 0 alerts at the Prometheus layer, mention that in the doc
so it's clear this alert only covers graceful shutdown.
Summary
docs/metrics.md: Prometheus metrics reference with descriptions, labels, recommended alerting rules, and example PromQL queriesdocs/runbook.md: operational runbook for on-call engineers covering failure modes, troubleshooting steps, recovery procedures, and escalation pathsdocs/observability.md: add cross-reference link to the newmetrics.mdTest plan
pkg/metrics/recorder.goandpkg/health/metrics.gocmd/adapter/main.goSummary by CodeRabbit