Skip to content

Fix race condition between job enqueue and concurrency unblock#712

Open
rosa wants to merge 1 commit intomainfrom
fix-race-condition-in-concurrency-controls
Open

Fix race condition between job enqueue and concurrency unblock#712
rosa wants to merge 1 commit intomainfrom
fix-race-condition-in-concurrency-controls

Conversation

@rosa
Copy link
Member

@rosa rosa commented Feb 13, 2026

This addresses #456.

There is a race condition in the concurrency control mechanism where a job that finishes and tries to unblock the next blocked execution can miss a BlockedExecution that is being created concurrently. This causes the blocked job to remain stuck until the ConcurrencyMaintenance periodic task runs (potentially minutes later).

It happens as follows:

  1. Job A is running (semaphore value=0)

  2. Job B enqueue starts: reads semaphore (value=0, no row lock) → decides to block

  3. Job A finishes: Semaphore.signalUPDATE value to 1 (succeeds immediately since no lock held)

  4. Job A: BlockedExecution.release_oneSELECT finds nothing (Job B's BlockedExecution not committed yet)

  5. Job B enqueue commits: BlockedExecution now exists but nobody will unblock it

The root cause is that Semaphore::Proxy#wait doesn't lock the semaphore row when checking the semaphore. This allows the concurrent signal to complete before the enqueue transaction commits, creating a window where the BlockedExecution is invisible.

To fix, we lock the semaphore row with FOR UPDATE during the wait check so that the enqueue transaction holds the lock from the check through BlockedExecution creation and commit. This forces a concurrent signal UPDATE to wait, guaranteeing the BlockedExecution is visible when release_one runs.

This shouldn't introduce any dead locks, as there's no new circular dependencies introduced by these two:

  • Enqueue path: locks Semaphore row → INSERTs BlockedExecution (no lock on existing rows)
  • release_one path: locks BlockedExecution row (SKIP LOCKED) → locks Semaphore row (via wait in release)

This addresses #456.

There is a race condition in the concurrency control mechanism where a
job that finishes and tries to unblock the next blocked execution can
miss a `BlockedExecution` that is being created concurrently. This
causes the blocked job to remain stuck until the `ConcurrencyMaintenance`
periodic task runs (potentially minutes later).

It happens as follows:

 1. Job A is running (semaphore value=0)

 2. Job B enqueue starts: reads semaphore (value=0, no row lock) →
    decides to block

 3. Job A finishes: `Semaphore.signal` → `UPDATE` value to 1 (succeeds
    immediately since no lock held)

 4. Job A: `BlockedExecution.release_one` → `SELECT` finds nothing (Job B's
    `BlockedExecution` not committed yet)

 5. Job B enqueue commits: `BlockedExecution` now exists but nobody will
    unblock it

The root cause is that `Semaphore::Proxy#wait` doesn't lock the
semaphore row when checking the semaphore. This allows the concurrent
`signal` to complete before the enqueue transaction commits, creating a
window where the `BlockedExecution` is invisible.

To fix, we lock the semaphore row with `FOR UPDATE` during the wait check so that
the enqueue transaction holds the lock from the check through
`BlockedExecution` creation and commit. This forces a concurrent signal
`UPDATE` to wait, guaranteeing the `BlockedExecution` is visible when
release_one runs.

This shouldn't introduce any dead locks, as there's no new circular
dependencies introduced by these two:

- Enqueue path: locks `Semaphore` row → `INSERT`s `BlockedExecution` (no lock
  on existing rows)
- `release_one` path: locks `BlockedExecution` row (`SKIP LOCKED`) → locks
  `Semaphore` row (via wait in release)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rosa rosa force-pushed the fix-race-condition-in-concurrency-controls branch from 9a083fb to b06f470 Compare February 13, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant