Skip to content

feat(client): implement retry logic for fetching chain state, block t…#571

Draft
akrem-chabchoub wants to merge 1 commit intomasterfrom
rpc-connection-retry
Draft

feat(client): implement retry logic for fetching chain state, block t…#571
akrem-chabchoub wants to merge 1 commit intomasterfrom
rpc-connection-retry

Conversation

@akrem-chabchoub
Copy link
Contributor

@akrem-chabchoub akrem-chabchoub commented Feb 27, 2026

Retry with backoff in GetOrCreateMutableBatch

Summary

Adds retry with exponential backoff for the read-only RPC/API calls inside GetOrCreateMutableBatch. Transient failures (e.g. Bee API or Geth RPC briefly unavailable) no longer fail checks immediately and trigger false alerts.

Problem

Checks that use postage (pt-pss, retrieval, pushsync, fileretrieval, etc.) call GetOrCreateMutableBatch. That function talks to:

  • Bee API (GetChainState, PostageBatches) – the node may fail if its blockchain RPC is slow or down
  • Geth RPC (FetchBlockTime) – the RPC endpoint may hiccup

A single transient failure on any of these causes the whole check to fail and alert, even though the issue is outside our code (RPC/chain temporarily unavailable).

What changed

  • Retry helper retryDo in pkg/bee/client.go: exponential backoff (1s, 2s, 4s, 8s, 16s, capped at 30s), uses existing retryCount (default 5), respects context cancellation.
  • Wrapped only the 3 read-only steps in GetOrCreateMutableBatch:
    • GetChainState
    • FetchBlockTime
    • PostageBatches
  • No retry on CreatePostageBatch (write) to avoid creating duplicate batches.
  • Retries log a warning with operation name, attempt number and backoff so logs show when a transient failure was retried and recovered.

Files touched

File Change
pkg/bee/client.go GetOrCreateMutableBatch uses retryDo for the three read calls; added retryDo and maxRetryBackoff constant

Observed impact

Addresses errors like:

error="node bee-4-3: batched id get chain state: response message \"block number unavailable\": status: 500 Internal Server Error"

When the RPC is briefly unavailable, the call is retried with backoff instead of failing the check on the first attempt.

@akrem-chabchoub akrem-chabchoub added this to the 2026 milestone Mar 5, 2026
@akrem-chabchoub akrem-chabchoub self-assigned this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant