Hyperfleet-542 : Add OpenTelemetry instrumentation to Sentinel#51
Hyperfleet-542 : Add OpenTelemetry instrumentation to Sentinel#51tirthct wants to merge 31 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds OpenTelemetry tracing and observability across the service. Introduces pkg/telemetry with InitTraceProvider, Shutdown, StartSpan, and SetTraceContext; wires TracerProvider init/shutdown in cmd/sentinel driven by TRACING_ENABLED/OTEL_SERVICE_NAME; instruments the HTTP client with otelhttp; records sentinel.poll, sentinel.evaluate, and per-publish spans with attributes, errors, and CloudEvent traceparent attachment; extends LogConfig with OTel and SentinelConfig with MessagingSystem; and adds unit and integration tests validating tracer init, spans, propagation, and hierarchy. Sequence Diagram(s)sequenceDiagram
participant App as Application
participant TP as TracerProvider
participant Exporter as SpanExporter
participant Sentinel as Sentinel
participant Evaluator as Evaluator
participant Publisher as Publisher
participant Broker as Broker
App->>TP: InitTraceProvider(serviceName)
TP->>Exporter: configure exporter & sampler
TP-->>App: return TracerProvider
loop Poll cycle
Sentinel->>Sentinel: Start "sentinel.poll" span
Sentinel->>Evaluator: iterate resources
Evaluator->>Evaluator: Start "sentinel.evaluate" span (resource attrs)
Evaluator->>Publisher: create publish child span
Publisher->>Broker: Publish(event, publishCtx)
Publisher->>Exporter: Export spans
Publisher-->>Evaluator: publish result
Evaluator->>Evaluator: End "sentinel.evaluate" span
Sentinel->>Sentinel: End "sentinel.poll" span
end
App->>TP: Shutdown(ctx)
TP->>Exporter: Flush & shutdown
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/sentinel/sentinel.go (1)
113-187:⚠️ Potential issue | 🔴 CriticalSpan leak:
evalSpan.End()andpublishSpan.End()are skipped oncontinuepaths.When
event.SetDatafails (Line 141) orPublishfails (Line 164), thecontinuestatements skipevalSpan.End()at Line 187 andpublishSpan.End()at Line 167 respectively. Leaked spans will accumulate in memory and produce incomplete traces.The safest fix is to end spans via
deferor explicitEnd()calls before eachcontinue.Proposed fix
// span: sentinel.evaluate evalCtx, evalSpan := telemetry.StartSpan(ctx, "sentinel.evaluate", attribute.String("hyperfleet.resource_type", resource.Kind), attribute.String("hyperfleet.resource_id", resource.ID), ) decision := s.decisionEngine.Evaluate(resource, now) if decision.ShouldPublish { pending++ // Add decision reason to context for structured logging eventCtx := logger.WithDecisionReason(evalCtx, decision.Reason) // Create CloudEvent event := cloudevents.NewEvent() event.SetSpecVersion(cloudevents.VersionV1) event.SetType(fmt.Sprintf("com.redhat.hyperfleet.%s.reconcile", resource.Kind)) event.SetSource("hyperfleet-sentinel") event.SetID(uuid.New().String()) if err := event.SetData(cloudevents.ApplicationJSON, map[string]interface{}{ "kind": resource.Kind, "id": resource.ID, "generation": resource.Generation, "href": resource.Href, "reason": decision.Reason, }); err != nil { s.logger.Errorf(eventCtx, "Failed to set event data resource_id=%s error=%v", resource.ID, err) + evalSpan.End() continue } // span: publish (child of sentinel.evaluate) publishCtx, publishSpan := telemetry.StartSpan(eventCtx, fmt.Sprintf("%s publish", topic), attribute.String("hyperfleet.resource_type", resource.Kind), attribute.String("hyperfleet.resource_id", resource.ID), attribute.String("hyperfleet.decision_reason", decision.Reason), attribute.String("messaging.system", "gcp_pubsub"), attribute.String("messaging.operation.type", "publish"), attribute.String("messaging.destination.name", topic), attribute.String("messaging.message.id", event.ID()), ) if publishSpan.SpanContext().IsValid() { telemetry.SetTraceContext(&event, publishSpan) } // Publish to broker using configured topic if err := s.publisher.Publish(publishCtx, topic, &event); err != nil { // Record broker error metrics.UpdateBrokerErrorsMetric(resourceType, resourceSelector, "publish_error") s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) + publishSpan.End() + evalSpan.End() continue } publishSpan.End()Alternatively, a cleaner approach would be to use a closure or restructure to use
defer-like patterns within the loop body.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 113 - 187, The spans started by telemetry.StartSpan (evalSpan and publishSpan) are leaked on early continue paths (failures in event.SetData and s.publisher.Publish); fix by ensuring each span is ended on all control paths—either call defer evalSpan.End() immediately after creating evalSpan and defer publishSpan.End() right after creating publishSpan (so they run even on continue), or explicitly call publishSpan.End() and evalSpan.End() just before every continue in the error branches (references: telemetry.StartSpan, evalSpan, publishSpan, event.SetData, s.publisher.Publish).
🧹 Nitpick comments (4)
internal/sentinel/sentinel_test.go (1)
444-448: Global tracer provider is set but never restored after test.
otel.SetTracerProvider(tp)modifies the global state. While tests in the same package run sequentially by default, this could affect subsequent tests that rely on the global provider state (or lack thereof). Consider saving and restoring the previous provider.Suggested pattern
otel.SetTracerProvider(tp) + // Optionally restore previous provider on cleanup + // prevTP := otel.GetTracerProvider() + // defer otel.SetTracerProvider(prevTP)This is optional since tests currently run sequentially and the deferred
tp.Shutdownwill clean up the provider.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel_test.go` around lines 444 - 448, The test sets a global provider via trace.NewTracerProvider(...) and otel.SetTracerProvider(tp) but doesn't restore the previous global provider; capture the current provider with otel.GetTracerProvider() before calling otel.SetTracerProvider(tp), and defer otel.SetTracerProvider(previousProvider) so global state is restored; keep the existing deferred tp.Shutdown() to shut down the new provider.internal/client/client_test.go (1)
796-797: Ignoredw.Writeerror return value.Line 797 ignores the error from
w.Write(...), unlike the rest of the test file which consistently handles write errors.Suggested fix
- w.Write([]byte(`{"error": "internal server error"}`)) + if _, err := w.Write([]byte(`{"error": "internal server error"}`)); err != nil { + t.Logf("Error writing response: %v", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/client/client_test.go` around lines 796 - 797, The test HTTP handler ignores the error returned from w.Write(...) at the end of the handler; update the handler to check the write error and fail the test if it occurs. Specifically, replace the bare w.Write(...) call following w.WriteHeader(http.StatusInternalServerError) with a checked write (capture the return error from w.Write and call t.Fatalf or t.Errorf with the error), using the same testing.T variable used elsewhere in the file so failures are reported consistently.test/integration/integration_test.go (1)
459-464:InitTraceProviderwith stdout exporter will write span output to stdout during test runs.Since no
OTEL_EXPORTER_OTLP_ENDPOINTenv var is set in the test,InitTraceProviderfalls back to the stdout exporter, which will print formatted span data to stdout during test execution. This is noisy but functional. If this is intentional for debugging, it's fine; otherwise consider using an in-memory exporter as done inTestIntegration_EndToEndSpanHierarchy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/integration_test.go` around lines 459 - 464, The test currently calls telemetry.InitTraceProvider(ctx, "sentinel", "test", 1.0) which falls back to the stdout exporter and prints spans to stdout; change the test to use the in-memory exporter instead of stdout (or set OTEL_EXPORTER_OTLP_ENDPOINT) by invoking the InitTraceProvider variant or option the project uses for in-memory tracing (match the pattern used in TestIntegration_EndToEndSpanHierarchy), and keep the deferred telemetry.Shutdown(tp) call; update the call site in integration_test.go to use that in-memory configuration so spans are not emitted to stdout during test runs.pkg/logger/logger.go (1)
46-50: Consider validatingSamplingRaterange.
SamplingRateis afloat64with no validation at the config level. WhileInitTraceProviderinpkg/telemetry/otel.gohandles edge cases (≥1.0 → AlwaysSample, ≤0.0 → NeverSample), a negative or >1.0 value could be confusing to operators. Consider documenting the valid range or adding a validation method onOTelConfig.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/logger/logger.go` around lines 46 - 50, Add explicit validation for OTelConfig.SamplingRate by implementing a validation method (e.g., func (c *OTelConfig) Validate() error) that ensures SamplingRate is within the inclusive range 0.0–1.0 and returns a clear error if out of range; reference the OTelConfig struct and call this Validate method where configuration is loaded or before InitTraceProvider is invoked so invalid values error early (include a helpful message indicating the valid range).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 257-262: The telemetry shutdown is using the same shutdownCtx as
metricsServer.Shutdown and the log call uses the already-cancelled ctx; create a
separate context with its own timeout for telemetry (e.g., teleCtx via
context.WithTimeout) and call telemetry.Shutdown(teleCtx, tp) so it has its own
budget independent of metricsServer.Shutdown, ensure you cancel the teleCtx
after use, and change the error log invocation (log.Extra(...).Error) to use the
telemetry context (teleCtx) or the appropriate non-cancelled shutdown context so
the log is not emitted with a canceled ctx; reference telemetry.Shutdown,
metricsServer.Shutdown, shutdownCtx, tp, and log.Extra to locate and update the
code.
- Around line 153-169: The OTel settings never get set because initLogging uses
DefaultConfig() which leaves LogConfig.OTel.Enabled false; update initLogging to
populate logCfg.OTel.Enabled and logCfg.OTel.SamplingRate from configuration
sources (either the loaded SentinelConfig/YAML or environment variables) so the
check in main.go that inspects logCfg.OTel.Enabled can actually enable tracing;
specifically, within initLogging (and where config.LoadConfig/SentinelConfig is
processed) read an OTEL_ENABLED (boolean) env var or map the YAML fields into
LogConfig.OTel (including SamplingRate) and assign them to logCfg before main.go
calls telemetry.InitTraceProvider (referencing symbols: initLogging, LogConfig,
DefaultConfig, logCfg.OTel.Enabled, logCfg.OTel.SamplingRate,
config.LoadConfig).
In `@internal/sentinel/sentinel.go`:
- Around line 149-150: The attribute.String("messaging.system", "gcp_pubsub") is
hardcoded but should come from configuration; change the code that builds the
span/attributes (the attribute.String calls around
messaging.operation.type/publish in sentinel.go) to read the messaging system
from a configuration value or environment variable (e.g., cfg.MessagingSystem or
os.Getenv("MESSAGING_SYSTEM")) and use that value when constructing
attribute.String("messaging.system", <configuredValue>), with a sensible default
(e.g., "rabbitmq") if unset.
In `@pkg/telemetry/otel_test.go`:
- Around line 308-326: TestSetTraceContext_InvalidSpan currently ends a real
span which doesn't make SpanContext invalid; replace creation of the span with a
no-op/non-recording span (e.g., obtain span via
trace.SpanFromContext(context.Background())) so span.SpanContext().IsValid() is
false and the !IsValid() branch in SetTraceContext is exercised, then call
SetTraceContext(&event, span) and assert that no "traceparent" extension was
added to event.Extensions(); update imports accordingly to reference
go.opentelemetry.io/otel/trace if needed.
- Around line 42-77: TestInitTraceProvider_OTLPExporter currently points
OTEL_EXPORTER_OTLP_ENDPOINT at a non-existent collector and then calls Shutdown
which blocks/fails; fix by making the test use a local httptest server that
responds 200 to POST /v1/traces and set OTEL_EXPORTER_OTLP_ENDPOINT to that
server's URL (server.URL + "/v1/traces"), so InitTraceProvider, tracer use, and
the deferred Shutdown call complete reliably; update the test to start
httptest.NewServer before setting the env var and close the server at the end,
referencing TestInitTraceProvider_OTLPExporter, InitTraceProvider, and Shutdown
in your changes.
- Around line 65-70: Deferred cleanup closures calling Shutdown currently use
t.Fatal which is unsafe inside defers; change those deferred anonymous functions
(the ones invoking Shutdown(ctx, tp) and similar cleanup calls) to call t.Errorf
with the error when Shutdown returns non-nil instead of t.Fatal/t.Fatalf, so the
test reports the cleanup failure without invoking Fatal after the test body has
returned. Locate the three other deferred closures in this file that call
Shutdown or similar cleanup (same pattern as the closure around Shutdown at
lines noted in the review) and replace their t.Fatal/t.Fatalf calls with
t.Errorf while keeping the rest of the defer logic unchanged.
In `@pkg/telemetry/otel.go`:
- Line 117: Fix the typo in the doc comment for StartSpan: change "conext" to
"context" in the comment that reads "// StartSpan starts a span and enriches
conext with trace/span IDs for logging" so it correctly reads "// StartSpan
starts a span and enriches context with trace/span IDs for logging"; locate the
comment above the StartSpan function in otel.go and update the word only.
- Around line 22-29: The listed package-level string literals
(SAMPLER_ALWAYS_ON, SAMPLER_ALWAYS_OFF, SAMPLER_TRACE_ID_RATIO,
ENV_OTEL_TRACES_SAMPLER, ENV_OTEL_TRACES_SAMPLER_ARG,
ENV_OTEL_EXPORTER_OTLP_ENDPOINT) should be declared as constants instead of
variables; replace the var (...) block in pkg/telemetry/otel.go with a const
(...) block so these identifiers are immutable and cannot be reassigned
elsewhere in the package.
- Around line 139-145: The traceparent being set in SetTraceContext incorrectly
hardcodes the trace-flags byte to "01"; update SetTraceContext to read the
actual flags from span.SpanContext().TraceFlags() and format that byte as two
hex characters when building traceParent (keep the existing TraceID and SpanID
usage and the IsValid() guard). Use the span context's TraceFlags value
(converted to a uint8 and formatted with "%02x") instead of the fixed "-01" so
the traceparent accurately reflects whether the span was sampled.
- Around line 41-46: The code incorrectly passes a full URL from
ENV_OTEL_EXPORTER_OTLP_ENDPOINT into otlptracehttp.WithEndpoint (which expects
host:port), causing mis-parsing; fix by either removing the manual env var
handling and simply calling otlptracehttp.New(ctx) so the SDK reads
OTEL_EXPORTER_OTLP_ENDPOINT itself, or if you must pass it explicitly, replace
otlptracehttp.WithEndpoint(otlpEndpoint) with
otlptracehttp.WithEndpointURL(parsedURL) (use the URL form) and ensure
otlpEndpoint is a properly parsed *url.URL; update error logging in the
otlptracehttp.New/creation branch (log.Errorf) accordingly and remove the
redundant env read of ENV_OTEL_EXPORTER_OTLP_ENDPOINT.
In `@test/integration/integration_test.go`:
- Around line 816-817: The loop using pollSpans[:2] can panic if fewer than 2
spans were captured; update the iteration to safely bound the slice by computing
n := len(pollSpans) and if n > 2 set n = 2, then iterate over pollSpans[:n] (or
otherwise iterate up to min(2, len(pollSpans))). Apply this change around the
existing pollSpans variable and the for _, pollSpan := range pollSpans[:2] loop
so the test never slices beyond the available elements.
---
Outside diff comments:
In `@internal/sentinel/sentinel.go`:
- Around line 113-187: The spans started by telemetry.StartSpan (evalSpan and
publishSpan) are leaked on early continue paths (failures in event.SetData and
s.publisher.Publish); fix by ensuring each span is ended on all control
paths—either call defer evalSpan.End() immediately after creating evalSpan and
defer publishSpan.End() right after creating publishSpan (so they run even on
continue), or explicitly call publishSpan.End() and evalSpan.End() just before
every continue in the error branches (references: telemetry.StartSpan, evalSpan,
publishSpan, event.SetData, s.publisher.Publish).
---
Nitpick comments:
In `@internal/client/client_test.go`:
- Around line 796-797: The test HTTP handler ignores the error returned from
w.Write(...) at the end of the handler; update the handler to check the write
error and fail the test if it occurs. Specifically, replace the bare
w.Write(...) call following w.WriteHeader(http.StatusInternalServerError) with a
checked write (capture the return error from w.Write and call t.Fatalf or
t.Errorf with the error), using the same testing.T variable used elsewhere in
the file so failures are reported consistently.
In `@internal/sentinel/sentinel_test.go`:
- Around line 444-448: The test sets a global provider via
trace.NewTracerProvider(...) and otel.SetTracerProvider(tp) but doesn't restore
the previous global provider; capture the current provider with
otel.GetTracerProvider() before calling otel.SetTracerProvider(tp), and defer
otel.SetTracerProvider(previousProvider) so global state is restored; keep the
existing deferred tp.Shutdown() to shut down the new provider.
In `@pkg/logger/logger.go`:
- Around line 46-50: Add explicit validation for OTelConfig.SamplingRate by
implementing a validation method (e.g., func (c *OTelConfig) Validate() error)
that ensures SamplingRate is within the inclusive range 0.0–1.0 and returns a
clear error if out of range; reference the OTelConfig struct and call this
Validate method where configuration is loaded or before InitTraceProvider is
invoked so invalid values error early (include a helpful message indicating the
valid range).
In `@test/integration/integration_test.go`:
- Around line 459-464: The test currently calls telemetry.InitTraceProvider(ctx,
"sentinel", "test", 1.0) which falls back to the stdout exporter and prints
spans to stdout; change the test to use the in-memory exporter instead of stdout
(or set OTEL_EXPORTER_OTLP_ENDPOINT) by invoking the InitTraceProvider variant
or option the project uses for in-memory tracing (match the pattern used in
TestIntegration_EndToEndSpanHierarchy), and keep the deferred
telemetry.Shutdown(tp) call; update the call site in integration_test.go to use
that in-memory configuration so spans are not emitted to stdout during test
runs.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
cmd/sentinel/main.gogo.modinternal/client/client.gointernal/client/client_test.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/logger/logger.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
cmd/sentinel/main.go (2)
160-176:⚠️ Potential issue | 🔴 Critical
logCfg.OTel.Enabledis alwaysfalse— OTel initialization is unreachable dead code.
initLoggingconstructsLogConfigvialogger.DefaultConfig()(which setsOTel.Enabled: false) and only updatesLevel,Format, andOutput. NeitherOTEL_ENABLEDnorOTEL_SAMPLING_RATEare ever read and applied tologCfg.OTel. As a result, the branch at Line 166 is never taken and tracing is never activated.This was flagged in a previous review; the fix (reading
OTEL_ENABLED/OTEL_SAMPLING_RATEfrom env vars ininitLogging) has not been applied yet.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 160 - 176, The OTel branch is never reached because initLogging uses logger.DefaultConfig() which leaves logCfg.OTel.Enabled false and never reads OTEL_ENABLED/OTEL_SAMPLING_RATE; update initLogging to read the OTEL_ENABLED and OTEL_SAMPLING_RATE environment variables and apply them to the returned LogConfig (i.e., set logCfg.OTel.Enabled based on OTEL_ENABLED parsing and set logCfg.OTel.SamplingRate from OTEL_SAMPLING_RATE parsing), ensuring any parsing errors are handled sensibly so the downstream check of logCfg.OTel.Enabled in main.go will correctly enable telemetry; reference initLogging, logger.DefaultConfig(), LogConfig and logCfg.OTel.Enabled/SamplingRate in your changes.
290-295:⚠️ Potential issue | 🟠 MajorOTel shutdown reuses the shared
shutdownCtx, and the error log uses an already-cancelledctx.Two pre-existing issues remain unaddressed:
telemetry.Shutdownshares the sameshutdownCtx(20 s) already partially consumed byhealthServer.Shutdown+metricsServer.Shutdown. If those are slow, the OTel flush budget is eroded.- Line 293 passes the root
ctx(cancelled at Line 279) tolog.Extra(...).Error(...), so the log call is issued with a cancelled context.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 290 - 295, The OTel shutdown call reuses the shared shutdownCtx and logs using the already-cancelled root ctx; change telemetry.Shutdown(shutdownCtx, tp) to use its own fresh context and ensure the log call uses a live context: create a new context with timeout (e.g., otelCtx, otelCancel := context.WithTimeout(context.Background(), 20*time.Second)) and defer otelCancel(), call telemetry.Shutdown(otelCtx, tp), and pass a non-cancelled context (e.g., otelCtx or context.Background()) to log.Extra(...).Error(...) so the shutdown has its own budget and the error is logged with an active context.
🧹 Nitpick comments (1)
cmd/sentinel/main.go (1)
30-33:serviceNameis a runtime-configurable value mixed into the build-infovarblock.
version,commit, anddateare set byldflagsat build time and are effectively constants, whileserviceNameis mutated at runtime insiderunServe(Line 162). Keeping them in the same block is misleading. Consider makingserviceNamea local variable inrunServeor, if it needs to be the default value, a separateconst.♻️ Suggested refactor
var ( version = "dev" commit = "none" date = "unknown" - serviceName = "sentinel" )Then in
runServe:+ serviceName := "sentinel" if envServiceName := os.Getenv("OTEL_SERVICE_NAME"); envServiceName != "" { serviceName = envServiceName }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 30 - 33, The serviceName variable should not live in the same build-info var block as the ldflags-populated constants (version, commit, date); move serviceName out of that var block and make it either a local variable inside runServe (referencing runServe) if it is mutated at runtime, or make it a separate const/var with a default value distinct from version/commit/date if you need a package-level default. Update any references to serviceName accordingly (e.g., where runServe mutates it) so you no longer mix runtime-mutable state with the ldflags-injected build constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cmd/sentinel/main.go`:
- Around line 160-176: The OTel branch is never reached because initLogging uses
logger.DefaultConfig() which leaves logCfg.OTel.Enabled false and never reads
OTEL_ENABLED/OTEL_SAMPLING_RATE; update initLogging to read the OTEL_ENABLED and
OTEL_SAMPLING_RATE environment variables and apply them to the returned
LogConfig (i.e., set logCfg.OTel.Enabled based on OTEL_ENABLED parsing and set
logCfg.OTel.SamplingRate from OTEL_SAMPLING_RATE parsing), ensuring any parsing
errors are handled sensibly so the downstream check of logCfg.OTel.Enabled in
main.go will correctly enable telemetry; reference initLogging,
logger.DefaultConfig(), LogConfig and logCfg.OTel.Enabled/SamplingRate in your
changes.
- Around line 290-295: The OTel shutdown call reuses the shared shutdownCtx and
logs using the already-cancelled root ctx; change
telemetry.Shutdown(shutdownCtx, tp) to use its own fresh context and ensure the
log call uses a live context: create a new context with timeout (e.g., otelCtx,
otelCancel := context.WithTimeout(context.Background(), 20*time.Second)) and
defer otelCancel(), call telemetry.Shutdown(otelCtx, tp), and pass a
non-cancelled context (e.g., otelCtx or context.Background()) to
log.Extra(...).Error(...) so the shutdown has its own budget and the error is
logged with an active context.
---
Nitpick comments:
In `@cmd/sentinel/main.go`:
- Around line 30-33: The serviceName variable should not live in the same
build-info var block as the ldflags-populated constants (version, commit, date);
move serviceName out of that var block and make it either a local variable
inside runServe (referencing runServe) if it is mutated at runtime, or make it a
separate const/var with a default value distinct from version/commit/date if you
need a package-level default. Update any references to serviceName accordingly
(e.g., where runServe mutates it) so you no longer mix runtime-mutable state
with the ldflags-injected build constants.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (2)
cmd/sentinel/main.gogo.mod
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/client/client_test.go`:
- Line 804: The test currently ignores the error returned by
NewHyperFleetClient; update the test to capture both return values (client, err
:= NewHyperFleetClient(...)) and assert/fail if err != nil (e.g., t.Fatalf or
require.NoError) before using client so any constructor failures surface
immediately; ensure you also handle the client variable in subsequent
assertions.
- Around line 795-800: In the httptest server handler (the anonymous
http.HandlerFunc passed to httptest.NewServer) replace the unsafe t.Fatalf call
with t.Errorf and then return to avoid invoking FailNow from a handler
goroutine; specifically change the error handling after w.Write to call
t.Errorf("Failed to write response: %v", err) followed by return instead of
t.Fatalf.
- Around line 670-676: Capture the previous global tracer provider before
calling otel.SetTracerProvider(tp) in both
TestNewHyperFleetClient_HTTPInstrumentation and
TestNewHyperFleetClient_HTTPInstrumentation_ErrorCase (e.g., prevTp :=
otel.GetTracerProvider()), then in the defer restore it (e.g., defer
otel.SetTracerProvider(prevTp)) and still shutdown tp as before; ensure you
reference the local tp shutdown logic (tp.Shutdown(ctx)) but add restoring the
previous provider to avoid test-state leakage.
…OTel traceparent function and tests
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (3)
pkg/telemetry/otel.go (1)
93-93:⚠️ Potential issue | 🟠 MajorPrefer parent-based ratio sampling to preserve incoming trace decisions.
Using
trace.TraceIDRatioBaseddirectly can resample child spans independently. Wrap it withtrace.ParentBased(...)for distributed trace continuity.💡 Proposed fix
- sampler = trace.TraceIDRatioBased(samplingRate) + sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate))OpenTelemetry Go: For services receiving remote parent context, is ParentBased(TraceIDRatioBased(x)) recommended over TraceIDRatioBased(x) to preserve parent sampling decisions?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel.go` at line 93, Replace the direct TraceIDRatioBased sampler assignment so child spans respect remote parent sampling: instead of assigning sampler = trace.TraceIDRatioBased(samplingRate) wrap it with trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) (use the same samplingRate variable) so parent-based sampling is preserved for incoming trace contexts.cmd/sentinel/main.go (2)
34-34:⚠️ Potential issue | 🟡 MinorSet the default OTel service name to
hyperfleet-sentinel.Keeping the default as
sentinelcan fragment service identity in tracing backends unless explicitly overridden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` at line 34, The default OTel serviceName value currently set as serviceName = "sentinel" should be changed to "hyperfleet-sentinel" to standardize tracing identity; update the assignment of the serviceName variable in cmd/sentinel/main.go (the serviceName symbol) to use "hyperfleet-sentinel" as the default so traces are attributed consistently unless explicitly overridden.
301-305:⚠️ Potential issue | 🟡 MinorUse the telemetry shutdown context when logging shutdown errors.
The error log still uses
ctxinstead ofotelShutdownCtx, which may already be canceled during shutdown.💡 Proposed fix
- if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil { - log.Extra("error", err).Error(ctx, "Failed to shutdown OpenTelemetry") + if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil { + log.Extra("error", err).Error(otelShutdownCtx, "Failed to shutdown OpenTelemetry") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 301 - 305, The shutdown logging currently uses the outer request context variable ctx which may be canceled; change the log call to use the telemetry shutdown context instead: when calling telemetry.Shutdown(otelShutdownCtx, tp) and it returns an error, pass otelShutdownCtx into log.Extra(...).Error(...) (keeping the existing log.Extra("error", err) and message) so the failure is logged with the shutdown-specific context created by context.WithTimeout (otelShutdownCtx/otelShutdownCancel).
🧹 Nitpick comments (1)
test/integration/integration_test.go (1)
902-909: Trace continuity assertion is too broad and can become flaky.This currently counts trace IDs from all captured spans. If unrelated root spans are emitted, the check can fail even when poll/evaluate/publish hierarchy is correct. Prefer deriving continuity from poll traces (or poll descendants) only.
Suggested tightening
- // Verify trace continuity - spans should form coherent traces - traceIDs := make(map[string]bool) - for _, span := range spans { - traceIDs[span.SpanContext.TraceID().String()] = true - } - - if len(traceIDs) > len(pollSpans) { - t.Errorf("Expected spans to belong to %d traces (one per poll), but found %d trace IDs", len(pollSpans), len(traceIDs)) - } + // Verify trace continuity for poll roots specifically (one trace per poll span) + pollTraceIDs := make(map[string]struct{}) + for _, pollSpan := range pollSpans { + pollTraceIDs[pollSpan.SpanContext.TraceID().String()] = struct{}{} + } + if len(pollTraceIDs) != len(pollSpans) { + t.Errorf("Expected one unique trace per poll span, got %d unique traces for %d poll spans", len(pollTraceIDs), len(pollSpans)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/integration_test.go` around lines 902 - 909, The current check counts trace IDs from all captured spans which can include unrelated roots and become flaky; change it to derive continuity only from the poll traces by first building a set of expectedTraceIDs from pollSpans (use each pollSpan.SpanContext.TraceID().String()) and then only collect/count trace IDs from spans whose TraceID is in that expectedTraceIDs set (i.e., filter spans by membership in the poll trace set) before asserting the number of distinct poll-related traces matches len(pollSpans).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/config/config.go`:
- Around line 110-112: The code currently assigns MESSAGING_SYSTEM even when the
environment variable is present but empty; update the block that reads
os.LookupEnv("MESSAGING_SYSTEM") so it trims and checks the value is non-empty
before assigning to cfg.MessagingSystem (e.g., use strings.TrimSpace on the
looked-up value and only set cfg.MessagingSystem when the trimmed value != "");
this avoids clearing defaults or emitting empty messaging.system attributes.
In `@internal/sentinel/sentinel.go`:
- Around line 114-118: The telemetry spans created via telemetry.StartSpan
(e.g., evalSpan and publishSpan in the sentinel.evaluate flow) are not ended on
error paths that call continue; update each error branch that returns/continues
after calling resource.SetData or publisher.Publish to call the corresponding
span.End() (evalSpan.End() and, if publishSpan was created, publishSpan.End())
before continuing so all spans are closed; search for usages of SetData and
Publish within the sentinel evaluation loop (the branches around evalSpan and
publishSpan creation) and add the missing End() calls on every error path.
In `@pkg/telemetry/otel_test.go`:
- Around line 84-104: The test calls Shutdown twice (once in the deferred
anonymous func and once explicitly), causing flakiness; update
TestInitTraceProvider_OTLPExporter to call Shutdown only once by removing the
redundant call: either eliminate the explicit err = Shutdown(shutdownCtx, tp)
block (keeping the defer) or remove the defer and keep the explicit shutdown,
ensuring only a single call to Shutdown(ctx, tp) on the trace provider
(reference Shutdown and the deferred anonymous func that wraps Shutdown).
In `@test/integration/integration_test.go`:
- Around line 717-718: The test currently ignores the error returned by
client.NewHyperFleetClient; modify the setup to capture the error (e.g.,
hyperfleetClient, err := client.NewHyperFleetClient(...)) and fail fast if it is
non-nil by calling t.Fatalf or using the test assertion helper (e.g.,
require.NoError) so setup failures are explicit; update the code around
NewHyperFleetClient and any test setup that depends on hyperfleetClient to use
the checked variable.
- Around line 675-682: The test sets a new global OTEL tracer provider with
otel.SetTracerProvider(tp) but never restores the previous global provider,
leaking state to other tests; capture the existing provider via
otel.GetTracerProvider() before calling otel.SetTracerProvider(tp) and in the
deferred cleanup restore it by calling otel.SetTracerProvider(prevProvider) (and
then perform tp.Shutdown(cleanupCtx) as currently implemented), referencing
otel.SetTracerProvider, otel.GetTracerProvider, tp, and tp.Shutdown to locate
where to add the save/restore logic.
---
Duplicate comments:
In `@cmd/sentinel/main.go`:
- Line 34: The default OTel serviceName value currently set as serviceName =
"sentinel" should be changed to "hyperfleet-sentinel" to standardize tracing
identity; update the assignment of the serviceName variable in
cmd/sentinel/main.go (the serviceName symbol) to use "hyperfleet-sentinel" as
the default so traces are attributed consistently unless explicitly overridden.
- Around line 301-305: The shutdown logging currently uses the outer request
context variable ctx which may be canceled; change the log call to use the
telemetry shutdown context instead: when calling
telemetry.Shutdown(otelShutdownCtx, tp) and it returns an error, pass
otelShutdownCtx into log.Extra(...).Error(...) (keeping the existing
log.Extra("error", err) and message) so the failure is logged with the
shutdown-specific context created by context.WithTimeout
(otelShutdownCtx/otelShutdownCancel).
In `@pkg/telemetry/otel.go`:
- Line 93: Replace the direct TraceIDRatioBased sampler assignment so child
spans respect remote parent sampling: instead of assigning sampler =
trace.TraceIDRatioBased(samplingRate) wrap it with
trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) (use the same
samplingRate variable) so parent-based sampling is preserved for incoming trace
contexts.
---
Nitpick comments:
In `@test/integration/integration_test.go`:
- Around line 902-909: The current check counts trace IDs from all captured
spans which can include unrelated roots and become flaky; change it to derive
continuity only from the poll traces by first building a set of expectedTraceIDs
from pollSpans (use each pollSpan.SpanContext.TraceID().String()) and then only
collect/count trace IDs from spans whose TraceID is in that expectedTraceIDs set
(i.e., filter spans by membership in the poll trace set) before asserting the
number of distinct poll-related traces matches len(pollSpans).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
cmd/sentinel/main.goconfigs/dev-example.yamlconfigs/gcp-pubsub-example.yamlconfigs/rabbitmq-example.yamlinternal/client/client_test.gointernal/config/config.gointernal/sentinel/sentinel.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/client/client_test.go
… tracing standardization
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
cmd/sentinel/main.go (1)
301-305:⚠️ Potential issue | 🟡 MinorUse
otelShutdownCtxfor the shutdown error log context.Line 304 logs with
ctxeven though telemetry shutdown is executed underotelShutdownCtx.💡 Proposed fix
if tp != nil { otelShutdownCtx, otelShutdownCancel := context.WithTimeout(context.Background(), 5*time.Second) defer otelShutdownCancel() if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil { - log.Extra("error", err).Error(ctx, "Failed to shutdown OpenTelemetry") + log.Extra("error", err).Error(otelShutdownCtx, "Failed to shutdown OpenTelemetry") } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 301 - 305, The shutdown error log is using the wrong context; change the log call to use otelShutdownCtx (created by context.WithTimeout) instead of ctx so the telemetry.Shutdown error is logged with the same context it runs under: locate otelShutdownCtx and otelShutdownCancel and update the log.Extra(...).Error(...) invocation that follows telemetry.Shutdown(otelShutdownCtx, tp) to pass otelShutdownCtx as the context argument.internal/sentinel/sentinel.go (1)
134-143:⚠️ Potential issue | 🔴 CriticalEnd and mark
evalSpanonSetDatafailure beforecontinue.On Line 142,
continueskips Line 192, soevalSpanleaks on this error path and the span is not marked failed.🐛 Proposed fix
if err := event.SetData(cloudevents.ApplicationJSON, map[string]interface{}{ "kind": resource.Kind, "id": resource.ID, "generation": resource.Generation, "href": resource.Href, "reason": decision.Reason, }); err != nil { + evalSpan.RecordError(err) + evalSpan.SetStatus(codes.Error, "set event data failed") s.logger.Errorf(eventCtx, "Failed to set event data resource_id=%s error=%v", resource.ID, err) + evalSpan.End() continue }Also applies to: 192-192
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 134 - 143, When event.SetData(...) fails in the block around event.SetData, ensure you mark and end the `evalSpan` before returning/continuing: after logging the error in the SetData error branch, record the error on `evalSpan` (e.g., RecordError / SetStatus to error with the err message) and call `evalSpan.End()` prior to the `continue` so the span is not leaked; apply the same fix to the other identical error path near the later SetData call (the occurrence referenced at line ~192).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 151-155: The code currently ignores parse errors from
strconv.ParseBool for TRACING_ENABLED; modify the block around
otelEnabled/strconv.ParseBool so that when err != nil you either log a clear
error and exit (fail fast) or at minimum emit a warning to stderr/logger
indicating the invalid value and that tracing will remain at its default, e.g.
using log.Printf or fmt.Fprintf(os.Stderr, ...), referencing the TRACING_ENABLED
env var, the strconv.ParseBool call, and cfg.OTel.Enabled to show
current/default state; ensure successful ParseBool still sets cfg.OTel.Enabled
as before.
In `@pkg/telemetry/otel.go`:
- Around line 77-80: The code silently ignores strconv.ParseFloat errors when
reading ENV_OTEL_TRACES_SAMPLER_ARG; update the parsing branch (the block
referencing ENV_OTEL_TRACES_SAMPLER_ARG and samplingRate) to detect a parse
error and handle it explicitly by logging a clear warning or returning an error
that includes the environment variable name and the invalid value, and only
assign samplingRate on successful parse; ensure the log/returned message
references ENV_OTEL_TRACES_SAMPLER_ARG and the raw arg so misconfiguration is
visible.
---
Duplicate comments:
In `@cmd/sentinel/main.go`:
- Around line 301-305: The shutdown error log is using the wrong context; change
the log call to use otelShutdownCtx (created by context.WithTimeout) instead of
ctx so the telemetry.Shutdown error is logged with the same context it runs
under: locate otelShutdownCtx and otelShutdownCancel and update the
log.Extra(...).Error(...) invocation that follows
telemetry.Shutdown(otelShutdownCtx, tp) to pass otelShutdownCtx as the context
argument.
In `@internal/sentinel/sentinel.go`:
- Around line 134-143: When event.SetData(...) fails in the block around
event.SetData, ensure you mark and end the `evalSpan` before
returning/continuing: after logging the error in the SetData error branch,
record the error on `evalSpan` (e.g., RecordError / SetStatus to error with the
err message) and call `evalSpan.End()` prior to the `continue` so the span is
not leaked; apply the same fix to the other identical error path near the later
SetData call (the occurrence referenced at line ~192).
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/telemetry/otel.go (1)
87-96:ParentBasednot applied forAlwaysSampleandNeverSamplecases.When
samplingRate >= 1.0or<= 0.0, the sampler is set directly toAlwaysSample()orNeverSample()without wrapping inParentBased(). This means parent sampling decisions are ignored in these cases, potentially breaking trace continuity across service boundaries.For consistency with the architecture standard that specifies
parentbased_traceidratioas the default strategy, consider wrapping all samplers:♻️ Proposed fix
var sampler trace.Sampler switch { case samplingRate >= 1.0: - sampler = trace.AlwaysSample() // Sample all + sampler = trace.ParentBased(trace.AlwaysSample()) case samplingRate <= 0.0: - sampler = trace.NeverSample() // Sample none + sampler = trace.ParentBased(trace.NeverSample()) default: sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel.go` around lines 87 - 96, The sampler logic sets sampler directly to trace.AlwaysSample() or trace.NeverSample() for samplingRate >= 1.0 or <= 0.0, which bypasses parent-based behavior; change those branches to wrap the base samplers with trace.ParentBased(...) so all cases—including trace.AlwaysSample(), trace.NeverSample() and trace.TraceIDRatioBased(samplingRate)—are applied via trace.ParentBased, keeping the existing switch and the sampler variable but replacing direct calls with trace.ParentBased(trace.AlwaysSample()), trace.ParentBased(trace.NeverSample()), and trace.ParentBased(trace.TraceIDRatioBased(samplingRate)).pkg/telemetry/otel_test.go (1)
181-185: Test assertion forexpectedSample=falsecases may be misleading.For
always_off(line 117-121), the test expectsexpectedSample: falsebut only checksIsValid()whenexpectedSampleis true. The span context remains valid even when not sampled —NeverSamplestill produces a validSpanContext, it just won't be exported.To properly verify sampling behavior, check
span.SpanContext().IsSampled()instead:💡 Suggested enhancement
- if tt.expectedSample { - if !span.SpanContext().IsValid() { - t.Error("Expected valid span context for sampling=true") - } + if span.SpanContext().IsSampled() != tt.expectedSample { + t.Errorf("Expected IsSampled()=%v, got %v", tt.expectedSample, span.SpanContext().IsSampled()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 181 - 185, The test currently only checks span.SpanContext().IsValid() when tt.expectedSample is true, which doesn't verify sampling; update the assertion to use span.SpanContext().IsSampled() (and assert true/false according to tt.expectedSample) so cases like the "always_off" sampler correctly validate sampling behavior; locate the table-driven test and replace the IsValid-based conditional with an assertion on SpanContext().IsSampled() for the test case keyed by tt.expectedSample in the test function in pkg/telemetry/otel_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 449-454: The deferred shutdown of the tracer provider uses
t.Fatalf which runs after the test body and can mislead test results; change the
deferred call in the shutdown closure (the anonymous func that calls
tp.Shutdown(ctx) on the trace.TracerProvider instance `tp` using `ctx`) to use
t.Errorf to report errors instead of t.Fatalf, so cleanup failures are reported
without aborting after test completion.
- Line 467: The test currently ignores the error returned by
NewHyperFleetClient; change the setup to capture the error (e.g.,
hyperfleetClient, err := client.NewHyperFleetClient(server.URL, 10*time.Second))
and fail fast on error (e.g., t.Fatalf or require.NoError(t, err)) so setup
failures are reported immediately instead of surfacing as secondary errors later
when hyperfleetClient is used.
In `@pkg/telemetry/otel.go`:
- Around line 77-83: The current logic checks ENV_OTEL_TRACES_SAMPLER_ARG and
logs "Invalid" in the else branch when the env var is empty, but it fails to
warn on strconv.ParseFloat errors; update the logic in the block that reads
ENV_OTEL_TRACES_SAMPLER_ARG so that: if arg == "" do nothing (keep default
samplingRate), if arg != "" attempt strconv.ParseFloat and on err != nil call
log.Warnf with a clear message including ENV_OTEL_TRACES_SAMPLER_ARG and the bad
arg value and that the default samplingRate will be used, and only set
samplingRate when ParseFloat succeeds; adjust references to
ENV_OTEL_TRACES_SAMPLER_ARG, arg, strconv.ParseFloat, samplingRate, and
log.Warnf accordingly.
---
Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 181-185: The test currently only checks
span.SpanContext().IsValid() when tt.expectedSample is true, which doesn't
verify sampling; update the assertion to use span.SpanContext().IsSampled() (and
assert true/false according to tt.expectedSample) so cases like the "always_off"
sampler correctly validate sampling behavior; locate the table-driven test and
replace the IsValid-based conditional with an assertion on
SpanContext().IsSampled() for the test case keyed by tt.expectedSample in the
test function in pkg/telemetry/otel_test.go.
In `@pkg/telemetry/otel.go`:
- Around line 87-96: The sampler logic sets sampler directly to
trace.AlwaysSample() or trace.NeverSample() for samplingRate >= 1.0 or <= 0.0,
which bypasses parent-based behavior; change those branches to wrap the base
samplers with trace.ParentBased(...) so all cases—including
trace.AlwaysSample(), trace.NeverSample() and
trace.TraceIDRatioBased(samplingRate)—are applied via trace.ParentBased, keeping
the existing switch and the sampler variable but replacing direct calls with
trace.ParentBased(trace.AlwaysSample()), trace.ParentBased(trace.NeverSample()),
and trace.ParentBased(trace.TraceIDRatioBased(samplingRate)).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
cmd/sentinel/main.gointernal/client/client_test.gointernal/config/config.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/client/client_test.go
- internal/config/config.go
…r check for new hyperfleet client
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 142-149: Add nil/zero guards in SetTraceContext: return early if
event is nil or span is nil/invalid before calling span.SpanContext(); check
span != nil (or equivalent) and use span.SpanContext().IsValid() only after
those checks to avoid panics from dereferencing nil spans or nil events, and
keep the existing traceparent formatting logic (traceParent := fmt.Sprintf(...)
and event.SetExtension("traceparent", traceParent)) inside the guarded block so
they only run when both inputs are valid.
- Around line 41-43: The current code calls
otlptracehttp.WithEndpointURL(otlpEndpoint) (inside otlptracehttp.New), which
treats the value as a full URL and prevents the SDK from appending the
/v1/traces path; remove that call and either let the SDK read
OTEL_EXPORTER_OTLP_ENDPOINT itself (omit the WithEndpointURL option) or, if
specifying programmatically, replace WithEndpointURL with
otlptracehttp.WithEndpoint(...) using a host:port value (and keep the call site
in otlptracehttp.New).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/telemetry/otel.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/sentinel/sentinel_test.go
- internal/sentinel/sentinel.go
… check in SetTraceContext()
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 41-42: The exporter selection currently only checks
ENV_OTEL_EXPORTER_OTLP_ENDPOINT; update the gating logic to also consider
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and prefer the traces-specific variable
when present so otlptracehttp.New is attempted for traces; modify the check
around otlptracehttp.New (references: ENV_OTEL_EXPORTER_OTLP_ENDPOINT,
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, otlptracehttp.New, exporter, err) to
look for the traces endpoint first and fall back to the generic OTLP endpoint
before defaulting to stdout.
test/integration/integration_test.go
Outdated
|
|
||
| // Mock server that returns clusters requiring reconciliation | ||
| server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| callCount++ |
There was a problem hiding this comment.
Priority: Bug
callCount at line 726 is a plain int incremented from the HTTP handler
goroutine (line 731) without synchronization — httptest.NewServer runs each
handler in its own goroutine, so this is a data race. Running with -race would
flag it.
You already use atomic.Int32 for the same pattern in otel_test.go:72, so just
apply it here too:
| callCount++ | |
| var callCount atomic.Int32 |
There was a problem hiding this comment.
Ack, added similar implementation as the otel_test
test/integration/integration_test.go
Outdated
| // Mock server that returns clusters requiring reconciliation | ||
| server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| callCount++ | ||
| if callCount > 2 { |
There was a problem hiding this comment.
| if callCount > 2 { | |
| if callCount.Add(1) > 2 { |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/sentinel/sentinel.go (1)
205-207:⚠️ Potential issue | 🟡 MinorUse
evalCtxfor skipped-resource logs too.Line 206 rebuilds
skipCtxfromctx, so skipped-resource logs lose thespan_idthattelemetry.StartSpanattached toevalCtx. BuildskipCtxfromevalCtxhere, the same way the publish path useseventCtx, so thesentinel.evaluatespan stays correlated with its logs.♻️ Proposed fix
- skipCtx := logger.WithDecisionReason(ctx, decision.Reason) + skipCtx := logger.WithDecisionReason(evalCtx, decision.Reason)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 205 - 207, The skipped-resource log is built from ctx which drops the telemetry span; change the context used when creating skipCtx to evalCtx so the span started by telemetry.StartSpan remains attached—i.e., where you call logger.WithDecisionReason(ctx, decision.Reason) create skipCtx from evalCtx instead (same pattern used for eventCtx in the publish path) so sentinel.evaluate span stays correlated with skipped-resource logs.
♻️ Duplicate comments (3)
pkg/telemetry/otel_test.go (2)
325-334:⚠️ Potential issue | 🟡 MinorGuard the
traceparentassertions before slicing.If the length check fails,
traceParentStr[:3]andtraceParentStr[len(traceParentStr)-3:]can still panic and hide the real assertion failure. Usestrings.HasPrefix/HasSuffixor return immediately after the length mismatch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 325 - 334, The test currently slices traceParentStr after an unchecked length assertion which can panic; change the checks so the length guard prevents further slicing—either call t.Fatalf or return immediately if len(traceParentStr) != 55, or replace the prefix/suffix slices with strings.HasPrefix(traceParentStr, "00-") and strings.HasSuffix(traceParentStr, "-01") to avoid slicing altogether; update the assertions that reference traceParentStr[:3] and traceParentStr[len(traceParentStr)-3:] accordingly.
21-38:⚠️ Potential issue | 🟠 MajorRestore the previous global OTel state in cleanup.
InitTraceProvidersets the process-wide tracer provider and propagator, but this cleanup only shutstpdown. That leaves later tests pointing at mutated global state, which makes failures order-dependent. Capture the previous provider/propagator before initialization and restore both in cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 21 - 38, TestInitTraceProvider_StdoutExporter leaves global OTel state mutated; before calling InitTraceProvider, capture the existing global tracer provider and text map propagator (e.g. via otel.GetTracerProvider() and otel.GetTextMapPropagator()), then after initialization ensure the deferred cleanup not only calls Shutdown(ctx, tp) but also restores the captured global provider and propagator (set them back with otel.SetTracerProvider(...) and otel.SetTextMapPropagator(...)) so later tests aren’t affected.internal/sentinel/sentinel_test.go (1)
646-656:⚠️ Potential issue | 🟠 MajorRestore the previous global tracer provider after the test.
otel.SetTracerProvider(tp)mutates process-wide state, but the defer only shutstpdown. Later tests can inherit a closed provider and become order-dependent. Captureotel.GetTracerProvider()before the set and restore it in the cleanup closure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel_test.go` around lines 646 - 656, Capture the existing global tracer provider before calling otel.SetTracerProvider(tp) (e.g. oldTP := otel.GetTracerProvider()), then update the deferred cleanup to first restore the global provider with otel.SetTracerProvider(oldTP) and then shut down tp using the existing shutdown logic (use the same ctx), so the test does not leave a closed provider as the global one; reference otel.SetTracerProvider, otel.GetTracerProvider, tp and the deferred shutdown closure when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 81-105: The sampler construction must select the OTEL sampler
based on samplerType instead of always wrapping in trace.ParentBased; update the
logic that currently uses samplingRate and trace.ParentBased(...) to a switch on
samplerType: use trace.AlwaysSample() for "always_on", trace.NeverSample() for
"always_off", trace.TraceIDRatioBased(samplingRate) for "traceidratio" (ensuring
samplingRate is parsed and in [0,1]), and
trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) for
"parentbased_traceidratio" (also validating samplingRate); keep the existing env
parsing for samplingRate but apply it only to the types that need a rate, and
replace occurrences of trace.ParentBased(trace.AlwaysSample()) /
trace.ParentBased(trace.NeverSample()) /
trace.ParentBased(trace.TraceIDRatioBased(...)) with the correct direct samplers
referenced by samplerType, samplerType, samplingRate, trace.AlwaysSample,
trace.NeverSample, and trace.TraceIDRatioBased.
---
Outside diff comments:
In `@internal/sentinel/sentinel.go`:
- Around line 205-207: The skipped-resource log is built from ctx which drops
the telemetry span; change the context used when creating skipCtx to evalCtx so
the span started by telemetry.StartSpan remains attached—i.e., where you call
logger.WithDecisionReason(ctx, decision.Reason) create skipCtx from evalCtx
instead (same pattern used for eventCtx in the publish path) so
sentinel.evaluate span stays correlated with skipped-resource logs.
---
Duplicate comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 646-656: Capture the existing global tracer provider before
calling otel.SetTracerProvider(tp) (e.g. oldTP := otel.GetTracerProvider()),
then update the deferred cleanup to first restore the global provider with
otel.SetTracerProvider(oldTP) and then shut down tp using the existing shutdown
logic (use the same ctx), so the test does not leave a closed provider as the
global one; reference otel.SetTracerProvider, otel.GetTracerProvider, tp and the
deferred shutdown closure when making the change.
In `@pkg/telemetry/otel_test.go`:
- Around line 325-334: The test currently slices traceParentStr after an
unchecked length assertion which can panic; change the checks so the length
guard prevents further slicing—either call t.Fatalf or return immediately if
len(traceParentStr) != 55, or replace the prefix/suffix slices with
strings.HasPrefix(traceParentStr, "00-") and strings.HasSuffix(traceParentStr,
"-01") to avoid slicing altogether; update the assertions that reference
traceParentStr[:3] and traceParentStr[len(traceParentStr)-3:] accordingly.
- Around line 21-38: TestInitTraceProvider_StdoutExporter leaves global OTel
state mutated; before calling InitTraceProvider, capture the existing global
tracer provider and text map propagator (e.g. via otel.GetTracerProvider() and
otel.GetTextMapPropagator()), then after initialization ensure the deferred
cleanup not only calls Shutdown(ctx, tp) but also restores the captured global
provider and propagator (set them back with otel.SetTracerProvider(...) and
otel.SetTextMapPropagator(...)) so later tests aren’t affected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6f58494c-e397-4fff-bb86-a3b3ccc4244f
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (14)
cmd/sentinel/main.goconfigs/dev-example.yamlconfigs/gcp-pubsub-example.yamlconfigs/rabbitmq-example.yamlgo.modinternal/client/client.gointernal/client/client_test.gointernal/config/config.gointernal/health/health.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/health/health.go
🚧 Files skipped from review as they are similar to previous changes (6)
- internal/client/client.go
- configs/gcp-pubsub-example.yaml
- configs/dev-example.yaml
- test/integration/integration_test.go
- configs/rabbitmq-example.yaml
- internal/config/config.go
| var samplingRate = defaultSamplingRate | ||
|
|
||
| log := logger.NewHyperFleetLogger() | ||
|
|
There was a problem hiding this comment.
The tracing standard specifies OTLP gRPC as the default export protocol
(OTEL_EXPORTER_OTLP_PROTOCOL defaults to grpc, code samples use otlptracegrpc,
Kubernetes examples use port 4317). This PR uses otlptracehttp instead (line
42), which defaults to port 4318.
If the Collector is configured per the standard
(otel-collector.observability.svc:4317), trace export will fail silently
because the HTTP client will try to talk HTTP to a gRPC port.
The JIRA ticket says to follow the hyperfleet-api reference implementation —
worth checking which exporter it uses so the two components stay consistent.
If HTTP was chosen intentionally (e.g., for simpler TLS setup), it might be
worth documenting the deviation and also handling OTEL_EXPORTER_OTLP_PROTOCOL
to support both.
There was a problem hiding this comment.
Hyperfleet-api does not have any implementation related to this for the tracing standards. Used the suggested gRPC here.
… TODO to set OTel enable to true in default config
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
pkg/telemetry/otel_test.go (2)
296-305:⚠️ Potential issue | 🟡 MinorAvoid fixed-index slicing after a non-fatal length check.
If
traceParentStris shorter than expected, thet.Errorfon Line 297 does not stop execution and the next slices can panic, hiding the real failure.💡 Proposed fix
+import "strings" @@ - if traceParentStr[:3] != "00-" { - t.Errorf("Expected traceparent to start with '00-', got %s", traceParentStr[:3]) + if !strings.HasPrefix(traceParentStr, "00-") { + t.Errorf("Expected traceparent to start with '00-', got %s", traceParentStr) } - if traceParentStr[len(traceParentStr)-3:] != "-01" { - t.Errorf("Expected traceparent to end with '-01', got %s", traceParentStr[len(traceParentStr)-3:]) + if !strings.HasSuffix(traceParentStr, "-01") { + t.Errorf("Expected traceparent to end with '-01', got %s", traceParentStr) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 296 - 305, The test currently uses t.Errorf for the traceParentStr length check then immediately slices traceParentStr which can panic if shorter; change the length assertion to a fatal assertion (use t.Fatalf) or return after reporting the error so execution stops, or alternatively perform guarded checks (e.g., verify len(traceParentStr) >= 55 before doing traceParentStr[:3] and traceParentStr[len-3:]) when validating traceParentStr in the otel_test.go test where traceParentStr is inspected for the "00-" prefix and "-01" suffix.
43-77:⚠️ Potential issue | 🟠 MajorUse a real local OTLP endpoint in this test.
This regresses to the same fake-collector setup that makes
Shutdownenvironment-dependent. With a fake host, the exporter can block/fail while flushing, so the test is no longer hermetic.💡 Proposed fix
+ server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + })) + defer server.Close() + - err := os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") + err := os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", server.URL) if err != nil { t.Fatalf("Failed to set OTEL_EXPORTER_OTLP_ENDPOINT: %v", err) } @@ - err = Shutdown(ctx, tp) + shutdownCtx, cancel := context.WithTimeout(ctx, 2*time.Second) + defer cancel() + err = Shutdown(shutdownCtx, tp)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 43 - 77, The test sets OTEL_EXPORTER_OTLP_ENDPOINT to a fake host which can make Shutdown hang; update TestInitTraceProvider_OTLPExporter to use a real local OTLP endpoint (e.g. "http://localhost:4317") instead of "http://fake-otel-collector:4317" and add a quick connectivity pre-check (dial or HTTP GET) that skips the test if the local collector is not reachable so the test remains hermetic; modify references in the test that call InitTraceProvider and Shutdown to rely on the new endpoint and skip early when the pre-check fails.
🧹 Nitpick comments (2)
pkg/telemetry/otel_test.go (1)
180-186: Restore the previous global tracer provider in these tests.Both tests replace the package-global provider with
otel.SetTracerProvider(tp)and only shuttpdown. Later tests then inherit this in-memory provider and its sampling/export settings.💡 Proposed fix
+ previousProvider := otel.GetTracerProvider() otel.SetTracerProvider(tp) defer func(ctx context.Context, tp *trace.TracerProvider) { err := Shutdown(ctx, tp) if err != nil { t.Errorf("Failed to shutdown trace provider : %v", err) } + otel.SetTracerProvider(previousProvider) }(ctx, tp)Also applies to: 261-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 180 - 186, The tests replace the global tracer provider with otel.SetTracerProvider(tp) but never restore the previous provider; capture the prior provider before calling otel.SetTracerProvider(tp) (e.g., prev := otel.GetTracerProvider()), and in the defer that currently calls Shutdown(ctx, tp) add a step to restore the global provider by calling otel.SetTracerProvider(prev) after shutting down tp; ensure the same change is applied to both occurrences (around otel.SetTracerProvider(tp) / Shutdown(ctx, tp) blocks, including the second instance at lines noted).go.mod (1)
17-21: Align all OTel Go modules to the same release version.The current pins mix
v1.42.0forotel/sdk/tracewithv1.40.0for the exporters. OpenTelemetry Go's versioning standards require these modules to be kept synchronized—they're released together as a unit and should not be intentionally pinned to different versions. This prevents incompatibilities and makes future upgrades clearer.♻️ Suggested alignment
- go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.40.0 - go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.40.0 + go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.42.0 + go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.42.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go.mod` around lines 17 - 21, The go.mod pins mix OpenTelemetry module versions (go.opentelemetry.io/otel, go.opentelemetry.io/otel/sdk, go.opentelemetry.io/otel/trace at v1.42.0 while go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp and go.opentelemetry.io/otel/exporters/stdout/stdouttrace are at v1.40.0); update the exporter module versions to match the others (e.g., change otlptrace/otlptracehttp and stdouttrace to v1.42.0) so all otel modules are aligned, then run module tidy/get (go mod tidy or go get) to ensure dependency graph is consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/config/config.go`:
- Around line 117-122: The YAML-loaded file-backed value for messaging_system
isn't being normalized, so a blank or whitespace value can overwrite the
default; ensure the code trims whitespace and ignores empty values before
assigning to cfg.MessagingSystem (same logic used for the MESSAGING_SYSTEM env
branch). After unmarshalling (or wherever file values are applied), call
strings.TrimSpace on the file-provided messagingSystem and only set
cfg.MessagingSystem if the trimmed value != "".
In `@internal/sentinel/sentinel_test.go`:
- Around line 650-661: There are two defer blocks both calling tp.Shutdown(ctx);
remove the duplicate and consolidate into a single defer that restores the
previous provider and calls tp.Shutdown(ctx) exactly once with proper error
handling. Specifically, keep the previousProvider capture (from
otel.GetTracerProvider()), call otel.SetTracerProvider(tp) as before, then
replace the two defers with one defer func() that calls tp.Shutdown(ctx), checks
the returned error and reports it via t.Errorf if non-nil, and finally calls
otel.SetTracerProvider(previousProvider); ensure you reference tp, ctx,
previousProvider, otel.SetTracerProvider and tp.Shutdown in the combined defer
so Shutdown is only invoked once.
In `@internal/sentinel/sentinel.go`:
- Around line 139-145: The skip-branch currently builds skipCtx from the parent
ctx so skipped-resource logs are correlated to sentinel.poll; instead use the
per-resource evaluation context (evalCtx) created by
telemetry.StartSpan("sentinel.evaluate") so skip logs are children of that
span—update the logic that constructs skipCtx / any logging calls in the skip
branch to use evalCtx (and keep setting attributes on evalSpan as before),
ensuring you still cancel/finish the span appropriately after logging; locate
this change around the decisionEngine.Evaluate call, the evalCtx/evalSpan
variables, and the code that previously created skipCtx.
In `@pkg/logger/logger.go`:
- Around line 46-49: LogConfig.OTel.Enabled is declared but unused; either wire
it into the trace/span enrichment or remove it. Update buildEntry (the function
that currently adds trace_id/span_id from context.Context) to check
cfg.OTel.Enabled (OTelConfig.Enabled) before extracting/injecting
trace_id/span_id so toggling the flag disables those fields, or remove
OTelConfig.Enabled from LogConfig and any related code paths (and tests) if you
prefer to keep behavior unchanged. Ensure references to OTelConfig, LogConfig,
and buildEntry are updated consistently.
---
Duplicate comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 296-305: The test currently uses t.Errorf for the traceParentStr
length check then immediately slices traceParentStr which can panic if shorter;
change the length assertion to a fatal assertion (use t.Fatalf) or return after
reporting the error so execution stops, or alternatively perform guarded checks
(e.g., verify len(traceParentStr) >= 55 before doing traceParentStr[:3] and
traceParentStr[len-3:]) when validating traceParentStr in the otel_test.go test
where traceParentStr is inspected for the "00-" prefix and "-01" suffix.
- Around line 43-77: The test sets OTEL_EXPORTER_OTLP_ENDPOINT to a fake host
which can make Shutdown hang; update TestInitTraceProvider_OTLPExporter to use a
real local OTLP endpoint (e.g. "http://localhost:4317") instead of
"http://fake-otel-collector:4317" and add a quick connectivity pre-check (dial
or HTTP GET) that skips the test if the local collector is not reachable so the
test remains hermetic; modify references in the test that call InitTraceProvider
and Shutdown to rely on the new endpoint and skip early when the pre-check
fails.
---
Nitpick comments:
In `@go.mod`:
- Around line 17-21: The go.mod pins mix OpenTelemetry module versions
(go.opentelemetry.io/otel, go.opentelemetry.io/otel/sdk,
go.opentelemetry.io/otel/trace at v1.42.0 while
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp and
go.opentelemetry.io/otel/exporters/stdout/stdouttrace are at v1.40.0); update
the exporter module versions to match the others (e.g., change
otlptrace/otlptracehttp and stdouttrace to v1.42.0) so all otel modules are
aligned, then run module tidy/get (go mod tidy or go get) to ensure dependency
graph is consistent.
In `@pkg/telemetry/otel_test.go`:
- Around line 180-186: The tests replace the global tracer provider with
otel.SetTracerProvider(tp) but never restore the previous provider; capture the
prior provider before calling otel.SetTracerProvider(tp) (e.g., prev :=
otel.GetTracerProvider()), and in the defer that currently calls Shutdown(ctx,
tp) add a step to restore the global provider by calling
otel.SetTracerProvider(prev) after shutting down tp; ensure the same change is
applied to both occurrences (around otel.SetTracerProvider(tp) / Shutdown(ctx,
tp) blocks, including the second instance at lines noted).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1725aa91-c23e-4ef8-afc0-215bd3ed0f3a
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
go.modinternal/config/config.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/logger/logger.gopkg/telemetry/otel.gopkg/telemetry/otel_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/telemetry/otel.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/sentinel/sentinel.go (1)
185-202: UsepublishCtxfor publish-path logs.Once the child publish span is started, these log lines should use
publishCtx; otherwise they keep the parentsentinel.evaluatespan ID and lose direct log-to-span correlation for the publish operation.♻️ Proposed fix
- s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) + s.logger.Errorf(publishCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) @@ - s.logger.Infof(eventCtx, "Published event resource_id=%s ready=%t", + s.logger.Infof(publishCtx, "Published event resource_id=%s ready=%t", resource.ID, resource.Status.Ready)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 185 - 202, The publish-path logs are using eventCtx (parent evaluate span) instead of the child publishCtx, breaking log-to-span correlation; update the two logging calls inside the publish block—specifically the s.logger.Errorf call in the Publish error branch and the s.logger.Infof success call after Publish—to pass publishCtx (not eventCtx), keeping publishSpan and evalSpan End() behavior unchanged so the publish logs correlate to the publishSpan.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 707-748: The test only checks span names/counts; update it to
assert trace hierarchy and the HTTP client span: use exporter.GetSpans() to
ensure all returned spans share the same TraceID, find the span with Name
"sentinel.poll" and assert that spans "sentinel.evaluate" and
"hyperfleet-clusters publish" have ParentSpanID equal to
sentinel.poll.SpanContext.SpanID, and assert there is an otelhttp client span
named "HTTP GET" present and tied to the same TraceID (and, if applicable, that
it is a child of "sentinel.evaluate"); also keep the existing published
CloudEvent traceparent check against mockPublisher.publishedEvents to ensure
propagation.
---
Nitpick comments:
In `@internal/sentinel/sentinel.go`:
- Around line 185-202: The publish-path logs are using eventCtx (parent evaluate
span) instead of the child publishCtx, breaking log-to-span correlation; update
the two logging calls inside the publish block—specifically the s.logger.Errorf
call in the Publish error branch and the s.logger.Infof success call after
Publish—to pass publishCtx (not eventCtx), keeping publishSpan and evalSpan
End() behavior unchanged so the publish logs correlate to the publishSpan.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4e41bf4e-8399-4df4-90a5-8e88e44a786e
📒 Files selected for processing (3)
internal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/logger/logger.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/logger/logger.go
… add support for grpc and http both
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/sentinel/sentinel.go (1)
170-202:⚠️ Potential issue | 🟡 MinorUse
publishCtxfor the publish-path logs.
telemetry.StartSpanalready returns a context enriched with the child publish span IDs. Logging witheventCtxkeeps these messages attached tosentinel.evaluate, so the new publish span does not show up in structured logs.🛠️ Suggested fix
if err := s.publisher.Publish(publishCtx, topic, &event); err != nil { publishSpan.RecordError(err) publishSpan.SetStatus(codes.Error, "publish failed") // Record broker error metrics.UpdateBrokerErrorsMetric(resourceType, resourceSelector, "publish_error") - s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) + s.logger.Errorf(publishCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) publishSpan.End() evalSpan.End() continue } - publishSpan.End() - // Record successful event publication metrics.UpdateEventsPublishedMetric(resourceType, resourceSelector, decision.Reason) - s.logger.Infof(eventCtx, "Published event resource_id=%s ready=%t", + s.logger.Infof(publishCtx, "Published event resource_id=%s ready=%t", resource.ID, resource.Status.Ready) + publishSpan.End() published++🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 170 - 202, The publish-path logs are using eventCtx instead of the child context returned by telemetry.StartSpan (publishCtx), so the logs remain associated with the parent span; change the logging calls around s.publisher.Publish to use publishCtx (not eventCtx) — specifically replace s.logger.Errorf(eventCtx, ...) and s.logger.Infof(eventCtx, ...) with s.logger.Errorf(publishCtx, ...) and s.logger.Infof(publishCtx, ...) respectively, keeping publishSpan.RecordError/SetStatus, publishSpan.End(), evalSpan.End(), and metrics.Update* calls unchanged so the publish span trace context is attached to those log entries.
♻️ Duplicate comments (1)
pkg/telemetry/otel.go (1)
77-102:⚠️ Potential issue | 🟠 Major
parentbased_traceidratiostill ignores the configured rate, and the other branches droppedParentBased(...).
OTEL_TRACES_SAMPLER_ARGis only parsed inside thetraceidratiobranch, so the default/parentbased_traceidratiopath always stays at1.0. At the same time,always_on,always_off, andtraceidrationow return standalone samplers, which breaks the parent-based behavior Sentinel relies on for trace continuity.🛠️ Suggested fix
- var sampler trace.Sampler - samplerType := strings.ToLower(os.Getenv(envOtelTracesSampler)) + if arg := os.Getenv(envOtelTracesSamplerArg); arg != "" { + if parsedRate, err := strconv.ParseFloat(arg, 64); err == nil && parsedRate >= 0.0 && parsedRate <= 1.0 { + samplingRate = parsedRate + } else { + log.Warnf(ctx, "Invalid %s value=%q, using default: %v", envOtelTracesSamplerArg, arg, samplingRate) + } + } + + var sampler trace.Sampler + samplerType := strings.ToLower(os.Getenv(envOtelTracesSampler)) switch samplerType { case samplerAlwaysOn: - sampler = trace.AlwaysSample() + sampler = trace.ParentBased(trace.AlwaysSample()) case samplerAlwaysOff: - sampler = trace.NeverSample() + sampler = trace.ParentBased(trace.NeverSample()) case samplerTraceIdRatio: - // Parse the sampling rate for traceidratio - rate := defaultSamplingRate // default 1.0 - if arg := os.Getenv(envOtelTracesSamplerArg); arg != "" { - if parsedRate, err := strconv.ParseFloat(arg, 64); err == nil && parsedRate >= 0.0 && parsedRate <= 1.0 { - rate = parsedRate - } else { - log.Warnf(ctx, "Invalid %s value=%q, using default: %v", envOtelTracesSamplerArg, arg, rate) - } - } - sampler = trace.TraceIDRatioBased(rate) + sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) case parentBasedTraceIdRatio, "": // Default per tracing standard sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate))Based on learnings, HyperFleet tracing standards require Sentinel sampler branches to remain wrapped in
trace.ParentBased(...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel.go` around lines 77 - 102, The sampler selection currently parses OTEL_TRACES_SAMPLER_ARG only inside the traceidratio case and returns standalone samplers, which ignores configured rates for parentbased_traceidratio and breaks ParentBased behavior; move parsing of envOtelTracesSamplerArg (into a rate variable, defaulting to samplingRate or defaultSamplingRate) before the switch so all branches can use the same rate, and ensure every non-error branch wraps its choice with trace.ParentBased(...): use trace.ParentBased(trace.TraceIDRatioBased(rate)) for traceidratio and parentBasedTraceIdRatio/"" branches, and wrap trace.AlwaysSample()/trace.NeverSample() as trace.ParentBased(trace.AlwaysSample()) / trace.ParentBased(trace.NeverSample()) so ParentBased continuity (trace.ParentBased, trace.TraceIDRatioBased, trace.AlwaysSample, trace.NeverSample, samplerType, envOtelTracesSamplerArg, samplingRate) is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/sentinel/sentinel.go`:
- Around line 170-202: The publish-path logs are using eventCtx instead of the
child context returned by telemetry.StartSpan (publishCtx), so the logs remain
associated with the parent span; change the logging calls around
s.publisher.Publish to use publishCtx (not eventCtx) — specifically replace
s.logger.Errorf(eventCtx, ...) and s.logger.Infof(eventCtx, ...) with
s.logger.Errorf(publishCtx, ...) and s.logger.Infof(publishCtx, ...)
respectively, keeping publishSpan.RecordError/SetStatus, publishSpan.End(),
evalSpan.End(), and metrics.Update* calls unchanged so the publish span trace
context is attached to those log entries.
---
Duplicate comments:
In `@pkg/telemetry/otel.go`:
- Around line 77-102: The sampler selection currently parses
OTEL_TRACES_SAMPLER_ARG only inside the traceidratio case and returns standalone
samplers, which ignores configured rates for parentbased_traceidratio and breaks
ParentBased behavior; move parsing of envOtelTracesSamplerArg (into a rate
variable, defaulting to samplingRate or defaultSamplingRate) before the switch
so all branches can use the same rate, and ensure every non-error branch wraps
its choice with trace.ParentBased(...): use
trace.ParentBased(trace.TraceIDRatioBased(rate)) for traceidratio and
parentBasedTraceIdRatio/"" branches, and wrap
trace.AlwaysSample()/trace.NeverSample() as
trace.ParentBased(trace.AlwaysSample()) / trace.ParentBased(trace.NeverSample())
so ParentBased continuity (trace.ParentBased, trace.TraceIDRatioBased,
trace.AlwaysSample, trace.NeverSample, samplerType, envOtelTracesSamplerArg,
samplingRate) is preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 20c6dc44-18b4-482e-9ac7-ee70a0597c51
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (4)
go.modinternal/sentinel/sentinel.gopkg/logger/logger_test.gopkg/telemetry/otel.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 58-61: Update the error log in the OTLP exporter creation so it
doesn't hardcode "gRPC exporter": in the function that creates the exporter (the
block where the exporter is constructed and where log.Errorf(ctx, "Failed to
create OTLP gRPC exporter: %v", err) is called), change the message to include
the actual exporter kind (e.g., use an exporterKind variable set to "gRPC" or
"HTTP") or use a generic message such as "Failed to create OTLP exporter" and
append the error; ensure you reference the same err variable and replace the
hardcoded string in the log.Errorf call accordingly.
- Around line 107-113: The parentBasedTraceIdRatio and default branches are
using an unparsed samplingRate (still defaultSamplingRate) and thus ignore
OTEL_TRACES_SAMPLER_ARG; update the logic that handles samplerType to parse
OTEL_TRACES_SAMPLER_ARG into samplingRate (e.g., parse/validate the value into
the samplingRate variable, handling errors and clamping to [0,1]) before
constructing the sampler, and then use that parsed samplingRate in
trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) for the
parentBasedTraceIdRatio case and the default case (reference symbols:
samplerType, parentBasedTraceIdRatio, samplingRate, trace.ParentBased,
trace.TraceIDRatioBased); also emit a warning via log.Warnf if the arg is
invalid and fall back to defaultSamplingRate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: aa2d26ec-cd5a-4874-b9e8-b7d4bec66ce9
📒 Files selected for processing (1)
pkg/telemetry/otel.go
| exporter, err = otlptracegrpc.New(ctx) | ||
| } | ||
| if err != nil { | ||
| log.Errorf(ctx, "Failed to create OTLP gRPC exporter: %v", err) |
There was a problem hiding this comment.
Priority: Bug
The error message on line 58 always says "Failed to create OTLP gRPC exporter" even when the
HTTP exporter was the one that failed. The if err != nil block handles errors from all
branches of the protocol switch above, but the message is hardcoded to say "gRPC".
Suggested fix — either move the error check inside each case, or use a generic message:
| log.Errorf(ctx, "Failed to create OTLP gRPC exporter: %v", err) | |
| log.Errorf(ctx, "Failed to create OTLP exporter (protocol=%s): %v", protocol, err) |
| github.com/testcontainers/testcontainers-go/modules/rabbitmq v0.40.0 | ||
| go.opentelemetry.io/otel v1.42.0 | ||
| go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.42.0 | ||
| go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.40.0 |
There was a problem hiding this comment.
Priority: Pattern
go.opentelemetry.io/otel/exporters/stdout/stdouttrace is at v1.40.0 while the rest of the OTel
modules are at v1.42.0 (go.mod line 19). The OTel Go project recommends keeping all modules
in lockstep to avoid subtle behavior mismatches.
go get go.opentelemetry.io/otel/exporters/stdout/stdouttrace@v1.42.0
go mod tidy| registry := prometheus.NewRegistry() | ||
| metrics.NewSentinelMetrics(registry, "test") | ||
|
|
||
| cfg := &config.SentinelConfig{ |
There was a problem hiding this comment.
Priority: Pattern
TestTrigger_CreatesRequiredSpans (line 683) creates a SentinelConfig without MessagingSystem,
which means the publish span will have messaging.system="". Since this test validates span
attributes, it should set the field explicitly (or use NewSentinelConfig()) so the test also
validates that messaging.system is correctly propagated.
cfg := &config.SentinelConfig{
ResourceType: "clusters",
Topic: "hyperfleet-clusters",
MaxAgeNotReady: 10 * time.Second,
MaxAgeReady: 30 * time.Minute,
MessagingSystem: "rabbitmq",
}
Summary
Implements comprehensive OpenTelemetry tracing instrumentation in Sentinel to enable distributed tracing across the HyperFleet orchestration pipeline.
Span hierarchy:
sentinel.poll (root)
Resolves
HYPERFLEET-542
Changed
Tested
Summary by CodeRabbit
New Features
Configuration
Tests
Chores