Skip to content

Bug: multiple workers reserving the same key in Job.reserve() in 2.1 #1398

@noahpettit

Description

@noahpettit

Bug Report

Description

When using populate(reserve_jobs=True) with multiple workers (e.g., SLURM array jobs), multiple workers can successfully "reserve" the same key and call make() simultaneously.
I noticed this when sumbitting multiple SLURM jobs that populate the same table: their output logs confirmed that 6/7 jobs were executing make() on the same key. This job reservation system has always been a nice feature of datajoint for distributed computing - preventing redundant computations and collisions.

Current workaround

Currently I think this is a rare occurance, probably made slightly more common by the fact that I am accessing my server across a network (not hosted locally) and submitting sbatch arrays. My workaround right now has been adding a random sleep (0-30s) before populate() in the SLURM script to stagger worker start times. With that in place I have not noticed the duplicate make / race.

I'm not very familiar with how datajoint manages job reservations, but below is what claude suggested might be the issue. Adding it to this post in case it is useful.

Job.reserve() in DataJoint 2.1 uses a non-atomic SELECT-then-UPDATE pattern that allows multiple workers to reserve the same job simultaneously. This is a regression from the 0.13.x approach which used an atomic INSERT ... ON DUPLICATE KEY pattern that was inherently atomic.

Root Cause

Job.reserve() (jobs.py:430-473) performs a check-then-act without atomicity:

def reserve(self, key: dict) -> bool:
    # Step 1: SELECT — check if job is pending
    job = (self & key & "status='pending'" & "scheduled_time <= CURRENT_TIMESTAMP(3)").to_dicts()
    if not job:
        return False

    # Step 2: UPDATE — mark as reserved
    pk = self._get_pk(key)
    update_row = {**pk, "status": "reserved", ...}
    try:
        self.update1(update_row)  # UPDATE ... SET status='reserved' WHERE <pk>
        return True
    except Exception:
        return False

The UPDATE's WHERE clause matches only on primary key, not on status='pending'. So if two workers both read the row as 'pending' before either updates, both UPDATEs succeed — the second simply overwrites the first worker's reservation.

Comparison with 0.13.x

The old JobTable.reserve() (0.13.x) used an atomic INSERT pattern:

def reserve(self, table_name, key):
    job = dict(key, table_name=table_name, status='reserved',
               host=platform.node(), pid=os.getpid(), ...)
    try:
        self.insert1(job)  # INSERT — fails with DuplicateError if row exists
    except DuplicateError:
        return False        # Another worker already has this key
    return True

This is inherently atomic: the first INSERT wins, all others get DuplicateError. No window exists between check and action.

Suggested Fix

Option A — Add a WHERE clause to the UPDATE:

UPDATE jobs SET status='reserved', ...
WHERE table_name=... AND key_hash=... AND status='pending'

Then check affected_rows == 1 to determine success. This is a single atomic operation.

Option B — Use SELECT ... FOR UPDATE before the check:

SELECT * FROM jobs WHERE ... AND status='pending' FOR UPDATE

This acquires a row-level lock, preventing concurrent readers from seeing the row as pending.

Option C — Restore the INSERT-based approach from 0.13.x, which was atomic by design.

Environment

  • datajoint 2.1.0
  • Python 3.12
  • MySQL 8.0
  • SLURM cluster, 7 concurrent array tasks calling populate(reserve_jobs=True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIndicates an unexpected problem or unintended behaviortriageIndicates issues, pull requests, or discussions need to be reviewed for the first time

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions