Bug: multiple workers reserving the same key in Job.reserve() in 2.1

## Bug Report

### Description
When using populate(reserve_jobs=True) with multiple workers (e.g., SLURM array jobs), multiple workers can successfully "reserve" the same key and call make() simultaneously. 
I noticed this when sumbitting multiple SLURM jobs that populate the same table: their output logs confirmed that 6/7 jobs were executing make() on the same key. This job reservation system has always been a nice feature of datajoint for distributed computing - preventing redundant computations and collisions. 

### Current workaround
Currently I think this is a rare occurance, probably made slightly more common by the fact that I am accessing my server across a network (not hosted locally) and submitting sbatch arrays. My workaround right now has been adding a random sleep (0-30s) before `populate()` in the SLURM script to stagger worker start times. With that in place I have not noticed the duplicate make / race. 

I'm not very familiar with how datajoint manages job reservations, but below is what claude suggested might be the issue. Adding it to this post in case it is useful.

> 
> `Job.reserve()` in DataJoint 2.1 uses a non-atomic SELECT-then-UPDATE pattern that allows multiple workers to reserve the same job simultaneously. This is a regression from the 0.13.x approach which used an atomic `INSERT ... ON DUPLICATE KEY` pattern that was inherently atomic.
> 
> ### Root Cause
> 
> `Job.reserve()` (jobs.py:430-473) performs a **check-then-act** without atomicity:
> 
> ```python
> def reserve(self, key: dict) -> bool:
>     # Step 1: SELECT — check if job is pending
>     job = (self & key & "status='pending'" & "scheduled_time <= CURRENT_TIMESTAMP(3)").to_dicts()
>     if not job:
>         return False
> 
>     # Step 2: UPDATE — mark as reserved
>     pk = self._get_pk(key)
>     update_row = {**pk, "status": "reserved", ...}
>     try:
>         self.update1(update_row)  # UPDATE ... SET status='reserved' WHERE <pk>
>         return True
>     except Exception:
>         return False
> ```
> 
> The UPDATE's WHERE clause matches only on primary key, **not** on `status='pending'`. So if two workers both read the row as `'pending'` before either updates, both UPDATEs succeed — the second simply overwrites the first worker's reservation.
> 
> ### Comparison with 0.13.x
> 
> The old `JobTable.reserve()` (0.13.x) used an atomic INSERT pattern:
> 
> ```python
> def reserve(self, table_name, key):
>     job = dict(key, table_name=table_name, status='reserved',
>                host=platform.node(), pid=os.getpid(), ...)
>     try:
>         self.insert1(job)  # INSERT — fails with DuplicateError if row exists
>     except DuplicateError:
>         return False        # Another worker already has this key
>     return True
> ```
> 
> This is inherently atomic: the first INSERT wins, all others get `DuplicateError`. No window exists between check and action.
> 
> ### Suggested Fix
> 
> Option A — Add a WHERE clause to the UPDATE:
> 
> ```sql
> UPDATE jobs SET status='reserved', ...
> WHERE table_name=... AND key_hash=... AND status='pending'
> ```
> 
> Then check `affected_rows == 1` to determine success. This is a single atomic operation.
> 
> Option B — Use `SELECT ... FOR UPDATE` before the check:
> 
> ```sql
> SELECT * FROM jobs WHERE ... AND status='pending' FOR UPDATE
> ```
> 
> This acquires a row-level lock, preventing concurrent readers from seeing the row as pending.
> 
> Option C — Restore the INSERT-based approach from 0.13.x, which was atomic by design.
> 
> ### Environment
> 
> - datajoint 2.1.0
> - Python 3.12
> - MySQL 8.0
> - SLURM cluster, 7 concurrent array tasks calling `populate(reserve_jobs=True)`
> 
> 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: multiple workers reserving the same key in Job.reserve() in 2.1 #1398

Bug Report

Description

Current workaround

Root Cause

Comparison with 0.13.x

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: multiple workers reserving the same key in Job.reserve() in 2.1 #1398

Description

Bug Report

Description

Current workaround

Root Cause

Comparison with 0.13.x

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions