Skip to content

Fix bugs in table upload#217

Open
fcollman wants to merge 29 commits intomasterfrom
fix_upload
Open

Fix bugs in table upload#217
fcollman wants to merge 29 commits intomasterfrom
fix_upload

Conversation

@fcollman
Copy link
Collaborator

@fcollman fcollman commented Mar 4, 2026

trying to find bugs in CSV upload feature

fcollman and others added 29 commits March 4, 2026 06:38
The inline vectorized spatial processing in process_chunk called
.astype(float) on coordinate columns, which fails when coordinates
are provided in combined "[x,y,z]" string format. Replace with calls
to the existing process_spatial_point method, which already handles
both string and separate-column formats. Also fixes missing-coordinate
fallback to use None instead of empty string.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Columns present in the DB model but absent from the CSV were
silently filled with empty string "". This caused silent data
corruption for nullable columns (which expect NULL) and hid
constraint violations for non-nullable columns. Use None so
nullable columns get a proper NULL and non-nullable columns
surface a clear DB error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/empty

fillna(0), fillna(0.0), and fillna(False) silently substituted 0 or
False for missing CSV values, corrupting data for fields where those
values have semantic meaning. Use pandas nullable dtypes (Int64,
boolean) and remove fillna so missing values remain NA and serialize
to empty CSV cells, which PostgreSQL imports as NULL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ries

_id_counter was always initialized to 0, so a retry (processor
re-instantiation) would regenerate IDs starting from 1 and collide
with rows already written in the staging table. The new
id_counter_start parameter lets callers pass the current max ID
from the staging table when resuming, making restarts safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add id_counter_start parameter to the process_csv Celery task and
pass it through to SchemaProcessor. After processing completes,
record last_assigned_id in both the return dict and Redis job status.
This lets callers (e.g. continuation batch uploads) pass the previous
run's final ID so generated IDs don't overlap with existing rows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When psql times out or fails, Popen.__exit__ closes dump_proc.stdout
and calls wait(). If pg_dump has suppressed SIGPIPE (libpq does this),
pg_dump stays alive and wait() blocks forever. Fix by catching any
exception inside the with-block, explicitly closing stdout and sending
SIGKILL to pg_dump before the exception reaches __exit__, so wait()
returns immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
raise ValueError(...) was discarding the original exception context,
making it harder to diagnose the root cause of GCS upload failures.
Add 'from e' so the full traceback chain is visible in logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Request parsing and storage.Client() initialization were outside the
try/except block, causing unhandled exceptions and opaque 500s when:
- request body is missing or not JSON (TypeError on data["filename"])
- "filename"/"contentType" keys are absent (KeyError)
- GCS credentials are not configured (DefaultCredentialsError)
- MATERIALIZATION_UPLOAD_BUCKET_PATH is not set

Move all logic inside the try, add an explicit null check on the
request body and the bucket config, and add a KeyError handler to
return a 400 for missing fields instead of a 500.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The JS read the error response body and logged it to the console but
then threw a new Error containing only the HTTP status code, so the
actual server message never reached the UI. Parse the JSON response
and include the message field in the thrown error so users see a
meaningful description instead of just "Failed to get upload URL: 500".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a Celery chord callback (finalize_chunk_outcome) is lost due to a
worker crash or warm shutdown during deployment, the affected chunk
remains in PROCESSING_SUBTASKS indefinitely. get_chunks_to_process
ignores that status, so the dispatcher polls forever with no recovery.

Track entry into PROCESSING_SUBTASKS with a per-chunk timestamp in a
new Redis hash (processing_subtasks_timestamps). Add
recover_stale_processing_subtasks() to CheckpointManager, which scans
for chunks stuck beyond a threshold (default 10 min) and moves them to
FAILED_RETRYABLE so the dispatcher can re-dispatch them. The dispatcher
calls this method each time it finds no chunks to dispatch, retrying
immediately if any are recovered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant