Fix race condition between job enqueue and concurrency unblock#712
Open
Fix race condition between job enqueue and concurrency unblock#712
Conversation
This addresses #456. There is a race condition in the concurrency control mechanism where a job that finishes and tries to unblock the next blocked execution can miss a `BlockedExecution` that is being created concurrently. This causes the blocked job to remain stuck until the `ConcurrencyMaintenance` periodic task runs (potentially minutes later). It happens as follows: 1. Job A is running (semaphore value=0) 2. Job B enqueue starts: reads semaphore (value=0, no row lock) → decides to block 3. Job A finishes: `Semaphore.signal` → `UPDATE` value to 1 (succeeds immediately since no lock held) 4. Job A: `BlockedExecution.release_one` → `SELECT` finds nothing (Job B's `BlockedExecution` not committed yet) 5. Job B enqueue commits: `BlockedExecution` now exists but nobody will unblock it The root cause is that `Semaphore::Proxy#wait` doesn't lock the semaphore row when checking the semaphore. This allows the concurrent `signal` to complete before the enqueue transaction commits, creating a window where the `BlockedExecution` is invisible. To fix, we lock the semaphore row with `FOR UPDATE` during the wait check so that the enqueue transaction holds the lock from the check through `BlockedExecution` creation and commit. This forces a concurrent signal `UPDATE` to wait, guaranteeing the `BlockedExecution` is visible when release_one runs. This shouldn't introduce any dead locks, as there's no new circular dependencies introduced by these two: - Enqueue path: locks `Semaphore` row → `INSERT`s `BlockedExecution` (no lock on existing rows) - `release_one` path: locks `BlockedExecution` row (`SKIP LOCKED`) → locks `Semaphore` row (via wait in release) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9a083fb to
b06f470
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This addresses #456.
There is a race condition in the concurrency control mechanism where a job that finishes and tries to unblock the next blocked execution can miss a
BlockedExecutionthat is being created concurrently. This causes the blocked job to remain stuck until theConcurrencyMaintenanceperiodic task runs (potentially minutes later).It happens as follows:
Job A is running (semaphore value=0)
Job B enqueue starts: reads semaphore (value=0, no row lock) → decides to block
Job A finishes:
Semaphore.signal→UPDATEvalue to 1 (succeeds immediately since no lock held)Job A:
BlockedExecution.release_one→SELECTfinds nothing (Job B'sBlockedExecutionnot committed yet)Job B enqueue commits:
BlockedExecutionnow exists but nobody will unblock itThe root cause is that
Semaphore::Proxy#waitdoesn't lock the semaphore row when checking the semaphore. This allows the concurrentsignalto complete before the enqueue transaction commits, creating a window where theBlockedExecutionis invisible.To fix, we lock the semaphore row with
FOR UPDATEduring the wait check so that the enqueue transaction holds the lock from the check throughBlockedExecutioncreation and commit. This forces a concurrent signalUPDATEto wait, guaranteeing theBlockedExecutionis visible when release_one runs.This shouldn't introduce any dead locks, as there's no new circular dependencies introduced by these two:
Semaphorerow →INSERTsBlockedExecution(no lock on existing rows)release_onepath: locksBlockedExecutionrow (SKIP LOCKED) → locksSemaphorerow (via wait in release)