Skip to content

fix AssertError for newer but lower message idx#194

Open
snubba wants to merge 1 commit intobakwc:masterfrom
snubba:fix_assert_error_in_apply_command
Open

fix AssertError for newer but lower message idx#194
snubba wants to merge 1 commit intobakwc:masterfrom
snubba:fix_assert_error_in_apply_command

Conversation

@snubba
Copy link
Contributor

@snubba snubba commented Feb 23, 2026

line 975-978

idx = message['log_idx']
term = message['log_term']
assert idx > self.__raftLastApplied          # ← crashes here
self.__commandsWaitingCommit[idx].append((term, callback))

The Race Condition

Key triggering scenario — leader reconnects after restart with stale socket data:

  1. Follower A sends apply_command to Leader B → commandsWaitingReply[N] = callback
  2. Leader B logs the command at log_idx=X, sends apply_command_response (message in OS socket buffer, not yet read by A)
  3. Leader B restarts → new TCP connection established from B → A
  4. In _onIncomingMessageReceived (transport.py:368), the new connection replaces the old one: self._connections[node] = new_conn, but the old socket stays registered with the poller
  5. The new connection delivers fresh append_entries with a snapshot → __loadDumpFile(clearJournal=True) → __raftLastApplied jumps past X
  6. Crucially: because the reconnecting node has the same address (same Node.id), line 886 evaluates self.__raftLeader != node as False__onLeaderChanged() is never called, so commandsWaitingReply[N] is not cleared
  7. The old socket then delivers its buffered apply_command_response with log_idx=X
  8. Callback is found in commandsWaitingReplyX > __raftLastAppliedFalseAssertionError

Secondary scenario — tick ordering:

The tick order is:
__applyLogEntries()__sendAppendEntries()_checkCommandsToApply()_poller.poll()
If __applyLogEntries() applies entry at idx=X (advancing __raftLastApplied to X) in the same tick that _poller.poll() delivers an apply_command_response with log_idx=X, you get X > X → False.

Why it "doesn't go away"

AssertionError is a subclass of Exception. The _autoTickThread has a blanket except Exception at line 539 that catches it, logs it, and continues running. The popped callback is silently dropped (never called, never put in commandsWaitingCommit). If the application retries the command under the same cluster instability, it can reproduce the same race.

@bakwc Can you please confirm this fix? It is hard to reproduce reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant