Conversation
…, and attestation propagation
The aggregated attestation pipeline was broken at multiple points, preventing finalization in multi-node setups: - Add missing GossipAggregatedAttestationEvent to network events - Add AGGREGATED_ATTESTATION decoding and dispatch in event sources - Fix SyncService.publish_aggregated_attestation to use a callback instead of a missing method on NetworkRequester - Wire publish callback directly in Node.from_genesis - Publish aggregates from ChainService._initial_tick (was discarded) - Enable test_late_joiner_sync with is_aggregator=True on node 0
- Unpack (store, aggregates) tuple from on_tick and aggregate_committee_signatures in fork choice fill framework - Update attestation target selection tests for the +1 safe_target allowance introduced in get_attestation_target
- Remove @pytest.mark.skip from test_mesh_finalization and test_mesh_2_2_2_finalization - Update store attribute references: latest_new_attestations -> latest_new_aggregated_payloads, latest_known_attestations -> latest_known_aggregated_payloads - test_partition_recovery remains xfail (known sync service limitation)
- Fix test_store_attestations indentation: steps 2-3 were inside for loop, causing aggregation to clear signatures before second validator check - Fix store.py lint: remove blank line after docstring, wrap long line - Fix type errors: peer_id accepts None for self-produced blocks throughout sync pipeline (SyncService, HeadSync, BlockCache) - Fix formatting in service.py, node_runner.py, test_multi_node.py
Replace fixed 70s duration loops with convergence helpers: - assert_all_finalized_to: polls until finalization target reached - assert_heads_consistent: polls until head slots converge - assert_same_finalized_checkpoint: polls until nodes agree This fixes CI flakiness where slow machines cause nodes to diverge during the fixed wait period. Tests now exit early on success and tolerate slower environments via generous timeouts.
- Increase gossipsub mesh stabilization from 5s to 10s in start_all (CI machines need more time for mesh formation before block production) - Increase finalization timeout from 100s to 150s - Increase peer connection timeout from 15s to 30s - Increase pytest timeout from 200s to 300s The CI failure showed all 3 nodes stuck at finalized slot 0, indicating gossip mesh wasn't fully formed when services started.
tcoratger
left a comment
There was a problem hiding this comment.
Just left couple of comments, but looks good to me in general. I just want to make sure that both in the interop test and in the test vector with the slots, we are testing the same behaviour as before.
| # Match Ream's 70 second test duration. | ||
| # Wait for finalization with convergence-based polling. | ||
| # | ||
| # Finalization requires sufficient time for: | ||
| # - Multiple slots to pass (4s each) | ||
| # - Attestations to accumulate | ||
| # - Justification and finalization to occur | ||
| run_duration = 70 | ||
| poll_interval = 5 | ||
|
|
||
| logger.info("Running chain for %d seconds (mesh_2_2_2)...", run_duration) | ||
|
|
||
| # Poll chain progress. | ||
| start = time.monotonic() | ||
| while time.monotonic() - start < run_duration: | ||
| slots = [node.head_slot for node in node_cluster.nodes] | ||
| finalized = [node.finalized_slot for node in node_cluster.nodes] | ||
| logger.info("Progress: head_slots=%s finalized=%s", slots, finalized) | ||
| await asyncio.sleep(poll_interval) | ||
| # Hub-and-spoke adds latency (messages route through hub) | ||
| # but the protocol should still achieve finalization. | ||
| await assert_all_finalized_to(node_cluster, target_slot=1, timeout=150) | ||
|
|
||
| # Final state capture. | ||
| head_slots = [node.head_slot for node in node_cluster.nodes] | ||
| finalized_slots = [node.finalized_slot for node in node_cluster.nodes] | ||
|
|
||
| logger.info("FINAL: head_slots=%s finalized=%s", head_slots, finalized_slots) | ||
|
|
||
| # Same assertions as full mesh. | ||
| # | ||
| # Despite reduced connectivity (messages route through hub), | ||
| # the protocol should still achieve full consensus. | ||
|
|
||
| # Chain must advance sufficiently. | ||
| assert all(slot >= 5 for slot in head_slots), ( | ||
| f"Chain did not advance enough. Head slots: {head_slots}" | ||
| ) | ||
|
|
||
| # Heads must be consistent across nodes. | ||
| # Verify heads converged across nodes. | ||
| # | ||
| # Hub-and-spoke adds latency but should not cause divergence. | ||
| head_diff = max(head_slots) - min(head_slots) | ||
| assert head_diff <= 2, f"Head slots diverged too much. Slots: {head_slots}, diff: {head_diff}" | ||
|
|
||
| # ALL nodes must finalize. | ||
| assert all(slot > 0 for slot in finalized_slots), ( | ||
| f"Not all nodes finalized. Finalized slots: {finalized_slots}" | ||
| ) |
There was a problem hiding this comment.
From the documentation, I guess you removed all this due to CI limitation?
There was a problem hiding this comment.
yes, these tests are flaky on CI. I tried to replace CI host machines to macos M1, but still they could end up with different results every run (see here)
If we fix CI we could keep the old logic
| Initially, the safe target is at genesis (slot 0). The attestation target | ||
| is allowed up to 1 slot ahead of safe target to ensure the chain can | ||
| start advancing from genesis. |
There was a problem hiding this comment.
With this new condition, are we sure that all the flow that we are testing is the same as before?
There was a problem hiding this comment.
this is probably not safe as we could vote for the new head as for the target in such case, which is not correct.
Let me think about workaround
There was a problem hiding this comment.
removed + Slot(1) from get attestation target
Remove all logger.debug() and logger.info() statements plus logging imports that were added during development for debugging.
The previous +Slot(1) offset allowed target to equal head when head was only 1 slot ahead of safe_target, conflating fork choice (head vote) with BFT finality (target vote). This violates the separation between the two mechanisms. With strict > safe_target.slot, the target never exceeds safe_target. Genesis bootstraps naturally: update_safe_target at interval 3 advances safe_target via 2/3+ head votes, so target progresses by slot 2.
Restore full docstrings for accept_new_attestations and aggregate_committee_signatures methods to match origin/main. The implementations changed (tuple return types, aggregates collection), but these docstrings document the overall behavior which remains the same. No functional change.
…back The strict > safe_target.slot walkback (without +Slot(1) offset) introduces a 1-slot delay in attestation target advancement. Targets lag by 1 slot compared to the old behavior, requiring more time for justification consensus. Increased finalization timeout from 150s to 180s in both mesh finalization tests. This gives the protocol enough time (~50-60s vs 30s) to reach finalization on typical CI machines.
The [6, 6, 0] finalization failure (2 nodes finalized, 1 stuck at genesis) indicates a gossip mesh formation race condition on slow CI machines. Changes: 1. Increase mesh stabilization delay from 10s to 15s for slower CI 2. Add assert_heads_consistent check after peer connections to verify gossip subscriptions are active (not just peer connections) This catches nodes that have peer connections but haven't completed gossip topic subscription handshakes, preventing isolated node scenarios.
unnawut
left a comment
There was a problem hiding this comment.
Just one main question the rest are nitpicks. Thanks!
|
|
||
| # Publish any aggregated attestations produced during catch-up. | ||
| if new_aggregated_attestations: | ||
| for agg in new_aggregated_attestations: | ||
| await self.sync_service.publish_aggregated_attestation(agg) | ||
|
|
There was a problem hiding this comment.
If the validator is catching up, i.e. many slots behind head slot, wouldn't this spam the network by publishing many stale aggregated_attestations returned by on_tick() for old slots?
I'm not sure if there's a scenario where a validator catching up can produce a useful aggregation but correct me if I'm wrong.
| self, | ||
| attestation: SignedAttestation, | ||
| peer_id: PeerId, # noqa: ARG002 | ||
| subnet_id: int, |
There was a problem hiding this comment.
I see subnet_id is added but not being used right now. Is there a plan to use it? If not I'm thinking we could remove it just to simplify the specs.
Also applies to peer_id in on_gossip_aggregated_attestation() below
| try: | ||
| self.store = self.store.on_gossip_aggregated_attestation(signed_attestation) | ||
| except (AssertionError, KeyError): | ||
| # Aggregation validation failed. |
There was a problem hiding this comment.
Can we log this at least as a warning? It'll be very hard to debug dropped aggregations otherwise. Thanks!
| # | ||
| # SyncService delegates aggregate publishing to NetworkService | ||
| # via a callback, avoiding a circular dependency. | ||
| sync_service._publish_agg_fn = network_service.publish_aggregated_attestation |
There was a problem hiding this comment.
Shouldn't this aggregation fn be registered with ValidatorService below than the sync service? By current SyncService definition I think publishing aggregation is quite far from its scope.
But if we want to decouple aggregator from the rest of validator role, we could also create AggregatorService instead of bundling to SyncService.
Just a suggestion. If it makes sense we can also take this as a separate Github issue.
There was a problem hiding this comment.
We could remove these changes as they're not critical to the PR. But no big deal we can avoid this in future big PRs.
🗒️ Description
🔗 Related Issues or PRs
✅ Checklist
toxchecks to avoid unnecessary CI fails:uvx tox