Skip to content

HYPERFLEET-585 - doc: create reliability documentation (runbook and metrics)#68

Open
xueli181114 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
xueli181114:HYPERFLEET-585
Open

HYPERFLEET-585 - doc: create reliability documentation (runbook and metrics)#68
xueli181114 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
xueli181114:HYPERFLEET-585

Conversation

@xueli181114
Copy link
Contributor

@xueli181114 xueli181114 commented Mar 5, 2026

Summary

  • Add docs/metrics.md: Prometheus metrics reference with descriptions, labels, recommended alerting rules, and example PromQL queries
  • Add docs/runbook.md: operational runbook for on-call engineers covering failure modes, troubleshooting steps, recovery procedures, and escalation paths
  • Update docs/observability.md: add cross-reference link to the new metrics.md

Test plan

  • Review metric names, types, labels, and bucket values against pkg/metrics/recorder.go and pkg/health/metrics.go
  • Review log patterns in runbook against source code
  • Verify PromQL alert expressions are syntactically correct
  • Verify health endpoint and port documentation matches cmd/adapter/main.go

Summary by CodeRabbit

  • Documentation
    • Added comprehensive metrics guidance with recommended alerting rules, example PromQL queries, and noted health and metrics endpoints for monitoring.
    • Added a detailed operational runbook covering failure modes, symptom patterns, remediation steps, recovery procedures, verification steps, and escalation paths.
    • Updated observability guidance to reference the new monitoring, alerting, and runbook resources.

@openshift-ci
Copy link

openshift-ci bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yingzhanredhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fce82963-4c77-4304-8ad9-6988aa15e4f8

📥 Commits

Reviewing files that changed from the base of the PR and between 7bfce2d and 8ea83d8.

📒 Files selected for processing (3)
  • docs/metrics.md
  • docs/observability.md
  • docs/runbook.md

Walkthrough

Adds three documentation files for the hyperfleet-adapter: docs/metrics.md (detailed metrics, PromQL queries, recommended alerting rules, metrics served on port 9090 and health endpoints on port 8080 with /healthz and /readyz), docs/runbook.md (operational runbook covering failure modes, troubleshooting steps, recovery, verification commands, and escalation/escalation paths), and updates docs/observability.md to point to metrics.md for alerting and monitoring guidance.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • mbrudnoy
  • vkareh
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: creating reliability documentation comprising a runbook and metrics guide, with the ticket reference for traceability.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 239-243: The "Errors by type" example uses
rate(hyperfleet_adapter_errors_total[5m]) which returns per-series rates but
does not group by the error_type label; update the example query to aggregate
rates grouped by error_type (e.g., wrap the rate(...) in a sum by
(error_type)(...) aggregation) so the output is explicitly grouped by the
error_type label as the heading describes.
- Around line 225-228: The success-rate query is wrongly keeping
status="success" on both numerator and denominator; change the denominator to
aggregate total traffic across statuses by removing the status selector and
summing the rate metric, e.g. use
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) /
sum(rate(hyperfleet_adapter_events_processed_total[5m])) * 100 so the numerator
uses hyperfleet_adapter_events_processed_total{status="success"} and the
denominator uses sum(rate(hyperfleet_adapter_events_processed_total[5m])) to
compute percentage of total events.
- Around line 158-162: The alert misfires because
rate(hyperfleet_adapter_events_processed_total[15m]) is evaluated per status
label; aggregate out status first by applying sum by(component, version) to the
rate expression so the comparison to zero is done across all statuses, and
ensure the resulting vector uses the same labels as the join
(hyperfleet_adapter_up) — i.e., replace the raw rate(...) == 0 with sum
by(component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) ==
0 so the alert only fires when total events across statuses are zero.

In `@docs/runbook.md`:
- Line 276: The table cell containing the phrase "May be config or resource
issue." is missing a subject; update that cell to a complete sentence such as
"This may be a config or resource issue." by replacing the existing text
fragment exactly where the row reads "| **P3 - Medium** | Single replica
crash-looping; other replicas healthy | Investigate logs. May be config or
resource issue. |".
- Around line 201-204: The runbook line uses curl which doesn't exercise gRPC;
replace that invocation with grpcurl and call the standard health service to
validate the Maestro gRPC endpoint (use grpcurl -plaintext
maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check),
ensuring you reference grpcurl and the grpc.health.v1.Health/Check method
against maestro-grpc.maestro.svc.cluster.local:8090.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d58129d4-09f7-40fe-bfbb-7d6a35c3102b

📥 Commits

Reviewing files that changed from the base of the PR and between 7bfce2d and 428a0e4.

📒 Files selected for processing (3)
  • docs/metrics.md
  • docs/observability.md
  • docs/runbook.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
docs/metrics.md (3)

145-145: ⚠️ Potential issue | 🟡 Minor

"Errors by type" query should group by error_type.

The heading indicates grouping by error type, but the query returns per-series rates without explicit grouping. Add sum by (error_type) aggregation to match the heading.

♻️ Proposed fix
-rate(hyperfleet_adapter_errors_total[5m])
+sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m]))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` at line 145, The "Errors by type" query uses
rate(hyperfleet_adapter_errors_total[5m]) but doesn't aggregate by error_type;
update the query that references hyperfleet_adapter_errors_total to wrap the
rate in a sum by (error_type) aggregation (i.e., use sum by
(error_type)(rate(hyperfleet_adapter_errors_total[5m]))) so results are grouped
by the error_type label.

128-131: ⚠️ Potential issue | 🟠 Major

Success-rate query divides success by success, not success by total.

The denominator will match on status="success" from the numerator, so both sides evaluate to the same metric, yielding success/success instead of success/total. Aggregate the denominator across status labels.

🔧 Proposed fix
-rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])
-/
-rate(hyperfleet_adapter_events_processed_total[5m])
-* 100
+sum by (component, version) (
+  rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])
+)
+/
+sum by (component, version) (
+  rate(hyperfleet_adapter_events_processed_total[5m])
+)
+* 100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 128 - 131, The query incorrectly divides
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by a
denominator that still includes {status="success"}, yielding success/success;
change the denominator to aggregate across status labels (e.g., use
rate(hyperfleet_adapter_events_processed_total[5m]) wrapped in a sum or
appropriate aggregation) so the numerator (rate(...{status="success"}[5m])) is
divided by the total rate across all statuses for
hyperfleet_adapter_events_processed_total.

61-64: ⚠️ Potential issue | 🟠 Major

No-events alert misfires due to status label fan-out.

The expression rate(hyperfleet_adapter_events_processed_total[15m]) == 0 is evaluated per status label, so it can alert even when events are flowing (e.g., when status="success" has events but status="failed" is zero). Aggregate away the status label before the comparison to zero.

🔧 Proposed fix
 expr: |
-  rate(hyperfleet_adapter_events_processed_total[15m]) == 0
+  sum by (component, version) (
+    rate(hyperfleet_adapter_events_processed_total[15m])
+  ) == 0
   and on(component, version)
   hyperfleet_adapter_up == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 61 - 64, The alert fires per `status` because
`rate(hyperfleet_adapter_events_processed_total[15m])` is not aggregated; change
the expression to aggregate away `status` before comparing to zero — e.g., wrap
the rate in a sum grouped by the same labels you join on (use sum by (component,
version) (rate(hyperfleet_adapter_events_processed_total[15m]))) and then
compare that aggregated value to 0 so the `status` label cannot cause fan-out;
keep the existing on(component, version) and hyperfleet_adapter_up == 1 logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 44-48: The alert expression mistakenly divides
rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m]) by
rate(hyperfleet_adapter_events_processed_total[5m]) which performs
vector-matching on the status label and yields 1; change the denominator to
aggregate across status labels (e.g., use
sum(rate(hyperfleet_adapter_events_processed_total[5m])) ) so the expression
becomes failed-rate / total-rate and correctly computes failed/total.

---

Duplicate comments:
In `@docs/metrics.md`:
- Line 145: The "Errors by type" query uses
rate(hyperfleet_adapter_errors_total[5m]) but doesn't aggregate by error_type;
update the query that references hyperfleet_adapter_errors_total to wrap the
rate in a sum by (error_type) aggregation (i.e., use sum by
(error_type)(rate(hyperfleet_adapter_errors_total[5m]))) so results are grouped
by the error_type label.
- Around line 128-131: The query incorrectly divides
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by a
denominator that still includes {status="success"}, yielding success/success;
change the denominator to aggregate across status labels (e.g., use
rate(hyperfleet_adapter_events_processed_total[5m]) wrapped in a sum or
appropriate aggregation) so the numerator (rate(...{status="success"}[5m])) is
divided by the total rate across all statuses for
hyperfleet_adapter_events_processed_total.
- Around line 61-64: The alert fires per `status` because
`rate(hyperfleet_adapter_events_processed_total[15m])` is not aggregated; change
the expression to aggregate away `status` before comparing to zero — e.g., wrap
the rate in a sum grouped by the same labels you join on (use sum by (component,
version) (rate(hyperfleet_adapter_events_processed_total[15m]))) and then
compare that aggregated value to 0 so the `status` label cannot cause fan-out;
keep the existing on(component, version) and hyperfleet_adapter_up == 1 logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84aea913-06f4-4e98-a1b1-f9b8554ff763

📥 Commits

Reviewing files that changed from the base of the PR and between 428a0e4 and 4ea607b.

📒 Files selected for processing (1)
  • docs/metrics.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (6)
docs/runbook.md (2)

201-204: ⚠️ Potential issue | 🟠 Major

Line 201–204: use grpcurl for Maestro gRPC health validation instead of curl.

curl does not exercise gRPC health RPCs, so this check can be misleading for protocol-level validation.

Proposed fix
 2. Test gRPC connectivity from the adapter pod:
    ```bash
-   kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090
+   kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check
    ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbook.md` around lines 201 - 204, Replace the curl-based connectivity
test with a gRPC health check using grpcurl: in the command that currently runs
curl against maestro-grpc.maestro.svc.cluster.local:8090, invoke grpcurl with
-plaintext and call the grpc.health.v1.Health/Check RPC (e.g., kubectl exec
<pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check) so the check exercises the gRPC health service
rather than an HTTP probe.

276-276: ⚠️ Potential issue | 🟡 Minor

Line 276: sentence fragment in escalation action cell.

Use a complete sentence for clarity in on-call guidance.

Proposed fix
-| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. May be config or resource issue. |
+| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbook.md` at line 276, The escalation action table row for "P3 -
Medium" currently contains a sentence fragment ("Investigate logs. May be config
or resource issue."); update that cell to a single complete sentence with clear
guidance — for example: "Investigate the crashing replica's logs and determine
whether the issue is caused by configuration or resource constraints, then
remediate accordingly." Ensure the revised text replaces the fragment in the
same table row so the on-call runbook reads as a full sentence.
docs/metrics.md (4)

142-146: ⚠️ Potential issue | 🟡 Minor

Line 142–146: “Errors by type” example should explicitly group by error_type.

Current query returns per-series rates but does not reflect the heading intent.

Proposed fix
-rate(hyperfleet_adapter_errors_total[5m])
+sum by (error_type) (rate(hyperfleet_adapter_errors_total[5m]))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 142 - 146, The "Errors by type" example
currently uses rate(hyperfleet_adapter_errors_total[5m]) which returns
per-series rates instead of aggregating by error type; update the PromQL to
aggregate by the error_type label (e.g. use sum by
(error_type)(rate(hyperfleet_adapter_errors_total[5m]))) so the query groups
error rates per error_type; ensure the metric hyperfleet_adapter_errors_total
contains the error_type label referenced.

44-48: ⚠️ Potential issue | 🔴 Critical

Line 44–48: failure-rate alert currently computes failed/failed, not failed/total.

Denominator is matched on status="failed" by default vector matching, so this can evaluate incorrectly as 1. Aggregate across status first.

Proposed fix
 expr: |
-  rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m])
-  /
-  rate(hyperfleet_adapter_events_processed_total[5m])
+  sum by (component, version) (
+    rate(hyperfleet_adapter_events_processed_total{status="failed"}[5m])
+  )
+  /
+  sum by (component, version) (
+    rate(hyperfleet_adapter_events_processed_total[5m])
+  )
   > 0.1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 44 - 48, The alert's denominator currently
matches only status="failed" due to vector matching, producing failed/failed;
change the denominator to aggregate across status first by summing the rate of
hyperfleet_adapter_events_processed_total over all statuses (e.g., use
sum(rate(hyperfleet_adapter_events_processed_total[5m])) ) so the expression
becomes failed-rate / total-rate and correctly computes failure rate.

62-64: ⚠️ Potential issue | 🟠 Major

Line 62–64: no-events alert can misfire due to status label fan-out.

rate(...events_processed_total...) == 0 is evaluated per status series. Sum across statuses before comparing to zero.

Proposed fix
 expr: |
-  rate(hyperfleet_adapter_events_processed_total[15m]) == 0
+  sum by (component, version) (
+    rate(hyperfleet_adapter_events_processed_total[15m])
+  ) == 0
   and on(component, version)
   hyperfleet_adapter_up == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 62 - 64, The alert currently compares
rate(hyperfleet_adapter_events_processed_total[15m]) == 0 per-status, causing
fan-out; change the expression to sum across statuses before comparing to zero
by wrapping the rate(...) in a sum grouped by component and version (e.g., use
sum by(component, version)
(rate(hyperfleet_adapter_events_processed_total[15m]))) and keep the uptime
check using hyperfleet_adapter_up == 1 joined on component and version; update
the alert expression to use these symbols
(hyperfleet_adapter_events_processed_total, rate(...), sum by(component,
version), and hyperfleet_adapter_up) so the zero check is performed on the
aggregated series rather than per-status series.

128-131: ⚠️ Potential issue | 🟠 Major

Line 128–131: success-rate query divides success by success instead of success by total.

Default matching keeps status="success" on both sides. Aggregate denominator over all statuses.

Proposed fix
-rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])
-/
-rate(hyperfleet_adapter_events_processed_total[5m])
+sum by (component, version) (
+  rate(hyperfleet_adapter_events_processed_total{status="success"}[5m])
+)
+/
+sum by (component, version) (
+  rate(hyperfleet_adapter_events_processed_total[5m])
+)
 * 100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 128 - 131, The success-rate query currently
divides rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by
rate(hyperfleet_adapter_events_processed_total[5m]) which keeps status="success"
on both sides and yields 100%; fix by aggregating the denominator across all
statuses so it doesn't match the success label — for example replace the
denominator with an aggregation like sum
without(status)(rate(hyperfleet_adapter_events_processed_total[5m])) (or
sum(rate(hyperfleet_adapter_events_processed_total[5m])) by (<other labels>)) so
the numerator
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) is divided
by the total across all statuses and then multiply by 100.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/metrics.md`:
- Around line 142-146: The "Errors by type" example currently uses
rate(hyperfleet_adapter_errors_total[5m]) which returns per-series rates instead
of aggregating by error type; update the PromQL to aggregate by the error_type
label (e.g. use sum by (error_type)(rate(hyperfleet_adapter_errors_total[5m])))
so the query groups error rates per error_type; ensure the metric
hyperfleet_adapter_errors_total contains the error_type label referenced.
- Around line 44-48: The alert's denominator currently matches only
status="failed" due to vector matching, producing failed/failed; change the
denominator to aggregate across status first by summing the rate of
hyperfleet_adapter_events_processed_total over all statuses (e.g., use
sum(rate(hyperfleet_adapter_events_processed_total[5m])) ) so the expression
becomes failed-rate / total-rate and correctly computes failure rate.
- Around line 62-64: The alert currently compares
rate(hyperfleet_adapter_events_processed_total[15m]) == 0 per-status, causing
fan-out; change the expression to sum across statuses before comparing to zero
by wrapping the rate(...) in a sum grouped by component and version (e.g., use
sum by(component, version)
(rate(hyperfleet_adapter_events_processed_total[15m]))) and keep the uptime
check using hyperfleet_adapter_up == 1 joined on component and version; update
the alert expression to use these symbols
(hyperfleet_adapter_events_processed_total, rate(...), sum by(component,
version), and hyperfleet_adapter_up) so the zero check is performed on the
aggregated series rather than per-status series.
- Around line 128-131: The success-rate query currently divides
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) by
rate(hyperfleet_adapter_events_processed_total[5m]) which keeps status="success"
on both sides and yields 100%; fix by aggregating the denominator across all
statuses so it doesn't match the success label — for example replace the
denominator with an aggregation like sum
without(status)(rate(hyperfleet_adapter_events_processed_total[5m])) (or
sum(rate(hyperfleet_adapter_events_processed_total[5m])) by (<other labels>)) so
the numerator
rate(hyperfleet_adapter_events_processed_total{status="success"}[5m]) is divided
by the total across all statuses and then multiply by 100.

In `@docs/runbook.md`:
- Around line 201-204: Replace the curl-based connectivity test with a gRPC
health check using grpcurl: in the command that currently runs curl against
maestro-grpc.maestro.svc.cluster.local:8090, invoke grpcurl with -plaintext and
call the grpc.health.v1.Health/Check RPC (e.g., kubectl exec <pod> -- grpcurl
-plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check) so the check exercises the gRPC health service
rather than an HTTP probe.
- Line 276: The escalation action table row for "P3 - Medium" currently contains
a sentence fragment ("Investigate logs. May be config or resource issue.");
update that cell to a single complete sentence with clear guidance — for
example: "Investigate the crashing replica's logs and determine whether the
issue is caused by configuration or resource constraints, then remediate
accordingly." Ensure the revised text replaces the fragment in the same table
row so the on-call runbook reads as a full sentence.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a18461ce-0e3c-4f23-8a95-6f91c15546ba

📥 Commits

Reviewing files that changed from the base of the PR and between 4ea607b and 69dc64d.

📒 Files selected for processing (3)
  • docs/metrics.md
  • docs/observability.md
  • docs/runbook.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/observability.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/runbook.md (2)

276-276: ⚠️ Potential issue | 🟡 Minor

Make the P3 action sentence complete.

Line 276 uses a fragment (“May be config or resource issue.”). Please make it a full sentence.

Proposed doc fix
-| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. May be config or resource issue. |
+| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbook.md` at line 276, Update the P3 row in the runbook table so the
fragment "May be config or resource issue." is rewritten as a complete sentence;
replace it with something like "It may be a configuration or resource issue." in
the table cell for the P3 - Medium entry to ensure grammatical completeness.

201-204: ⚠️ Potential issue | 🟠 Major

Use grpcurl (not curl) to validate Maestro gRPC health.

Line 203 currently tests only TCP/HTTP reachability and does not validate a gRPC health RPC.

Proposed doc fix
 2. Test gRPC connectivity from the adapter pod:
    ```bash
-   kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090
+   kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check
    ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbook.md` around lines 201 - 204, Replace the TCP/HTTP check that uses
"kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090" with
a proper gRPC health check using grpcurl; update the snippet to run "kubectl
exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check" so the documentation validates the gRPC health RPC
rather than only TCP/HTTP reachability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/runbook.md`:
- Line 276: Update the P3 row in the runbook table so the fragment "May be
config or resource issue." is rewritten as a complete sentence; replace it with
something like "It may be a configuration or resource issue." in the table cell
for the P3 - Medium entry to ensure grammatical completeness.
- Around line 201-204: Replace the TCP/HTTP check that uses "kubectl exec <pod>
-- curl -s maestro-grpc.maestro.svc.cluster.local:8090" with a proper gRPC
health check using grpcurl; update the snippet to run "kubectl exec <pod> --
grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check" so the documentation validates the gRPC health RPC
rather than only TCP/HTTP reachability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a1e79fd9-f3da-4553-8717-a440abab491e

📥 Commits

Reviewing files that changed from the base of the PR and between 69dc64d and e57c7b8.

📒 Files selected for processing (3)
  • docs/metrics.md
  • docs/observability.md
  • docs/runbook.md
✅ Files skipped from review due to trivial changes (1)
  • docs/observability.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/metrics.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/runbook.md (2)

201-204: ⚠️ Potential issue | 🟠 Major

Use grpcurl for gRPC endpoint validation, not curl.

The troubleshooting command uses curl to test the Maestro gRPC endpoint, but curl cannot natively invoke gRPC methods. It will fail to validate that the endpoint properly speaks the gRPC protocol.

Replace with grpcurl targeting the standard gRPC Health Checking Protocol to properly verify the Maestro gRPC endpoint is functional.

🔧 Proposed fix
 2. Test gRPC connectivity from the adapter pod:
    ```bash
-   kubectl exec <pod> -- curl -s maestro-grpc.maestro.svc.cluster.local:8090
+   kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check
    ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbook.md` around lines 201 - 204, The runbook uses curl to test the
Maestro gRPC endpoint which cannot validate gRPC; replace that line so the pod
exec uses grpcurl instead and call the standard gRPC Health Checking service
(grpc.health.v1.Health/Check) against
maestro-grpc.maestro.svc.cluster.local:8090 (e.g., via kubectl exec <pod> --
grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check) so the command actually verifies the gRPC protocol
and service health.

276-276: ⚠️ Potential issue | 🟡 Minor

Fix incomplete sentence.

The phrase "May be config or resource issue." is missing a subject. Complete the sentence for grammatical correctness.

✏️ Proposed fix
-| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. May be config or resource issue. |
+| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbook.md` at line 276, The table row for "P3 - Medium" contains an
incomplete sentence "May be config or resource issue."; update that cell (the
row with "**P3 - Medium | Single replica crash-looping; other replicas
healthy**") to a complete sentence such as "This may be a configuration or
resource issue." (or "It may be a configuration or resource issue.") to fix the
grammar while preserving the existing meaning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/runbook.md`:
- Around line 201-204: The runbook uses curl to test the Maestro gRPC endpoint
which cannot validate gRPC; replace that line so the pod exec uses grpcurl
instead and call the standard gRPC Health Checking service
(grpc.health.v1.Health/Check) against
maestro-grpc.maestro.svc.cluster.local:8090 (e.g., via kubectl exec <pod> --
grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090
grpc.health.v1.Health/Check) so the command actually verifies the gRPC protocol
and service health.
- Line 276: The table row for "P3 - Medium" contains an incomplete sentence "May
be config or resource issue."; update that cell (the row with "**P3 - Medium |
Single replica crash-looping; other replicas healthy**") to a complete sentence
such as "This may be a configuration or resource issue." (or "It may be a
configuration or resource issue.") to fix the grammar while preserving the
existing meaning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50e0d1c1-0e2b-4bf5-97c6-a945840d8497

📥 Commits

Reviewing files that changed from the base of the PR and between e57c7b8 and 538c55a.

📒 Files selected for processing (3)
  • docs/metrics.md
  • docs/observability.md
  • docs/runbook.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/metrics.md
  • docs/observability.md

@xueli181114
Copy link
Contributor Author

@coderabbitai what is your latest review result?

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the latest review comments from PR `#68`
gh pr view 68 --repo openshift-hyperfleet/hyperfleet-adapter --json reviews,comments,reviewDecision

# Get review comments (inline comments on the diff)
gh api repos/openshift-hyperfleet/hyperfleet-adapter/pulls/68/reviews --jq '.[] | {id, user: .user.login, state, body, submitted_at}'

Length of output: 81687


@xueli181114 Here is a summary of my latest review results for this PR. Across multiple review runs, the following issues have been consistently flagged and remain unresolved:


docs/metrics.md

Severity Location Issue
🔴 Critical Line 44–48 HyperFleetAdapterHighFailureRate alert — denominator matches status="failed" via default vector matching, so the expression computes failed/failed (≈1) instead of failed/total. Fix: wrap both numerator and denominator in sum by (component, version)(...).
🟠 Major Line 61–64 HyperFleetAdapterNoEventsProcessed alert — rate(...events_processed_total[15m]) == 0 is evaluated per status series, causing false alerts. Fix: aggregate with sum by (component, version)(rate(...)) before comparing to zero.
🟠 Major Line 128–131 Success-rate PromQL query — divides success/success instead of success/total because default vector matching keeps status="success" on both sides. Fix: wrap both sides in sum by (component, version)(...).
🟡 Minor Line 142–146 "Errors by type" example query — missing sum by (error_type) aggregation, so results are not grouped by error type as the heading implies.

docs/runbook.md

Severity Location Issue
🟠 Major Line 201–204 curl is used to test the Maestro gRPC endpoint, but curl cannot invoke gRPC methods. Replace with: kubectl exec <pod> -- grpcurl -plaintext maestro-grpc.maestro.svc.cluster.local:8090 grpc.health.v1.Health/Check
🟡 Minor Line 276 Sentence fragment in the P3 escalation table: "May be config or resource issue." — add a subject, e.g. "This may be a config or resource issue."

The most recent review run (2026-03-06T02:39:47Z, commit 538c55a0f) only re-checked docs/runbook.md (metrics.md was skipped as similar to previous changes). The PR is currently not approved. Would you like me to trigger a fresh full review once these issues are addressed?

…etrics)

Add operational documentation for on-call engineers and observability:

- docs/metrics.md: recommended alerting rules, thresholds, and PromQL
  queries (metric definitions live in observability.md)
- docs/runbook.md: failure modes, troubleshooting steps, recovery
  procedures, and escalation paths
- docs/observability.md: add cross-reference link to metrics.md
- Aggregate events rate by (component, version) in dead man's switch alert
- Group errors by error_type in example query
- Replace curl with grpcurl for Maestro gRPC health check
- Minor grammar fix in escalation table
@xueli181114
Copy link
Contributor Author

@coderabbitai review 8ea83d8 again

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

@xueli181114 Sure! I'll kick off a fresh full review of commit 8ea83d8aa08d3d4442cb71e6f3991a47c89baa80 on PR #68 right away.

✅ Actions performed

Full review triggered.

@xueli181114
Copy link
Contributor Author

@xueli181114 Sure! I'll kick off a fresh full review of commit 8ea83d8aa08d3d4442cb71e6f3991a47c89baa80 on PR #68 right away.
✅ Actions performed

does your review finish? You already take 10 mins.

@xueli181114
Copy link
Contributor Author

@xueli181114 Sure! I'll kick off a fresh full review of commit 8ea83d8aa08d3d4442cb71e6f3991a47c89baa80 on PR #68 right away.
✅ Actions performed

@coderabbitai how is your new review result? Post it

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the latest reviews from PR `#68`
gh api repos/openshift-hyperfleet/hyperfleet-adapter/pulls/68/reviews \
  --jq '.[] | {id, user: .user.login, state, submitted_at, body}' | tail -200

echo "---INLINE REVIEW COMMENTS---"
# Fetch inline review comments (pull request review comments)
gh api repos/openshift-hyperfleet/hyperfleet-adapter/pulls/68/comments \
  --jq '.[] | {id, user: .user.login, path, line, body, created_at}' | tail -300

Length of output: 51937


@xueli181114 Here are the latest review results based on all review runs available. Good news — all previously flagged issues have been resolved:


docs/metrics.md — All Issues Addressed

Severity Location Issue Status
🔴 Critical Line 44–48 HyperFleetAdapterHighFailureRatefailed/failed vector matching ✅ Fixed in 538c55a
🟠 Major Line 62–64 HyperFleetAdapterNoEventsProcessed — per-status fan-out ✅ Fixed in 73cc8df
🟠 Major Lines 225–228 Success-rate PromQL — dividing success/success instead of success/total ✅ Fixed in 538c55a
🟡 Minor Line 146 "Errors by type" — missing sum by (error_type) aggregation ✅ Fixed in 73cc8df

docs/runbook.md — All Issues Addressed

Severity Location Issue Status
🟠 Major Line 201–204 curl used to test gRPC Maestro endpoint — replaced with grpcurl ✅ Fixed in 73cc8df
🟡 Minor Line 276 Sentence fragment in P3 escalation table ✅ Fixed in 73cc8df

The two most recent review runs (2026-03-06T02:28:52Z and 2026-03-06T02:39:47Z) produced no new actionable comments — only duplicates of previously resolved items. The PR looks clean. Would you like me to trigger a final review or approve the PR?

Copy link
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this — having proper observability docs and a runbook is a big step for operational readiness. I went through all three files and cross-referenced against the codebase. Everything is very accurate (every log pattern, metric name, port, and endpoint checks out). Left a few inline suggestions below, nothing major.

@@ -89,3 +89,7 @@ The adapter automatically registers Prometheus metrics from the [hyperfleet-brok
| `hyperfleet_broker_message_duration_seconds` | Histogram | Message processing duration |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapter metrics tables above have a Labels column but the broker table doesn't — might be worth adding one for consistency. In particular, messages_consumed_total and message_duration_seconds both have a topic label (you can see it from RecordConsumed(topic) and RecordDuration(topic, ...) in the broker library), but that's not mentioned here. Only errors_total calls out its labels inline.

Not a big deal, but it'd help someone writing PromQL queries to know about topic upfront.

Copy link
Contributor

@rh-amarin rh-amarin Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a 1-1 relationship between:

  • adapter (component name)
  • subscription_id used by adapter
  • topic associated with subscription_id

So, this means that the info on the topic may be redundant.

But having it in the metrics may expedite the problem research.

It could also be useful if somehow there are metrics for an adapter for an unexpected topic, that may reveal some configuration error

```yaml
alert: HyperFleetAdapterNoEventsProcessed
expr: |
sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alert has a subtle edge case — if the adapter has never processed any events (fresh deploy, or topic is empty), the counter doesn't exist yet, so rate() returns nothing, sum() returns nothing, and == 0 has nothing to evaluate against. The alert silently won't fire, which is exactly the scenario you'd want it to catch.

You could add an absent() fallback to handle this:

expr: |
  (
    sum by (component, version) (rate(hyperfleet_adapter_events_processed_total[15m])) == 0
    or
    absent(hyperfleet_adapter_events_processed_total) == 1
  )
  and on(component, version)
  hyperfleet_adapter_up == 1

Or just add a note that this only fires once the first event has been processed — either way is fine.

expr: |
histogram_quantile(0.95,
rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])
) > 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth thinking about what happens when there are multiple adapter replicas. Right now histogram_quantile operates on the raw per-instance buckets. If two replicas scrape into the same Prometheus, you'd want to aggregate the buckets first:

histogram_quantile(0.95,
  sum by (component, version, le) (
    rate(hyperfleet_adapter_event_processing_duration_seconds_bucket[5m])
  )
) > 60

Not urgent if we're single-instance today, but worth a note or a TODO so it doesn't bite us later.


Events are ACKed on failure and not automatically retried. To reprocess:
1. Identify the failed event from logs (look for `event_id`)
2. Republish the event to the broker topic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is a bit hand-wavy for someone debugging at 3am — they probably need a concrete command rather than "republish the event to the broker topic". Maybe add examples per broker type?

For Pub/Sub:

gcloud pubsub topics publish <topic> --message '<event-json>'

For RabbitMQ:

rabbitmqadmin publish exchange=<exchange> routing_key=<key> payload='<event-json>'

Even a placeholder would help point people in the right direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can point to the async API contract which contains the templates for messages
https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/broker/asyncapi.yaml

kubectl exec <pod> -- curl -s localhost:8080/readyz | jq .

# Check metrics
kubectl exec <pod> -- curl -s localhost:9090/metrics | grep hyperfleet_adapter_up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery verification checks that the pod is up and ready, but doesn't confirm events are actually flowing again. Might be worth adding a quick metrics check:

# Verify events are being processed successfully
kubectl exec <pod> -- curl -s localhost:9090/metrics | grep 'hyperfleet_adapter_events_processed_total'

That way you know the adapter isn't just alive but actually doing work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As "doing work" may depend on messages being present on the queues, we could also ask to confirm there are messages the specific broker (e.g. pubsub)

| `"context cancelled while waiting for resource deletion"` | Recreate timed out waiting for deletion | Resource may have finalizers preventing deletion |

Retryable K8s errors (automatically retried): timeouts, server unavailable, internal errors, rate limiting.
Non-retryable: forbidden, unauthorized, bad request, invalid, gone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small gap here — the code in pkg/errors/k8s_error.go also treats network errors (connection refused, reset, etc.) as retryable, and "method not supported" / "not acceptable" as non-retryable. Might be worth adding those for completeness since they can show up in practice.

|----------|-----------|--------|
| **P1 - Critical** | Adapter down across all replicas; no events processing | Page on-call SRE. Check broker and API dependencies. |
| **P2 - High** | High failure rate (>10% events failing for 15+ min) | Notify team. Investigate error_type in metrics. |
| **P3 - Medium** | Single replica crash-looping; other replicas healthy | Investigate logs. This may be a config or resource issue. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny grammar nit: "May be config or resource issue" reads a bit choppy — maybe "This may be a config or resource issue" to match the tone of the rest of the table.

severity: warning
annotations:
summary: "No events processed"
description: "Adapter {{ $labels.component }} has not processed any events in 15 minutes."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Bug

The description on line 70 says "has not processed any events in 15 minutes"
but the actual delay before this alert fires is ~30 minutes.

rate(...[15m]) takes ~15 min to go to zero after the last event, then for: 15m
adds another 15 min on top. An on-call engineer seeing this alert would think
events stopped 15 min ago when it was actually closer to 30 min — that
matters for incident timelines.

A couple of options:

  1. Keep for: 15m and update the description to "...has not processed any
    events in ~30 minutes."
  2. Reduce for to something like 5m (just enough to avoid evaluation flapping)
    and update the description to "...in ~20 minutes." — this gives faster
    detection while still being stable.

The second option also reduces the gap between this alert and the
HyperFleetAdapterDown dead-man's-switch scenario — right now there's a 30 min
blind spot where the adapter could be silently not processing.


```yaml
alert: HyperFleetAdapterDown
expr: hyperfleet_adapter_up == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Bug

hyperfleet_adapter_up is set to 0 only during graceful shutdown (confirmed in
pkg/health/metrics.go). When a pod actually crashes, the metric goes stale in
Prometheus — it doesn't become 0. So this alert won't fire on the most common
"adapter down" scenario (OOM, panic, node failure, etc.).

The NoEventsProcessed alert covers this gap eventually, but with a ~30 min
delay (see other comment). That's a long time for a critical-severity
situation.

Suggestion — add an absent() clause to cover both graceful shutdown and
crashes:

  expr: >
    hyperfleet_adapter_up == 0
    or
    absent(hyperfleet_adapter_up{component="hyperfleet-adapter"})

Note: absent() will also fire if the metric has never been scraped (e.g.,
fresh Prometheus), so you might want to gate it with a condition or add a note
about expected initial noise. Alternatively, if operators already have
target-level up == 0 alerts at the Prometheus layer, mention that in the doc
so it's clear this alert only covers graceful shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants