Skip to content

Connection error handling#379

Open
tobyhede wants to merge 7 commits intomainfrom
connection-error-handling
Open

Connection error handling#379
tobyhede wants to merge 7 commits intomainfrom
connection-error-handling

Conversation

@tobyhede
Copy link
Contributor

@tobyhede tobyhede commented Mar 10, 2026

Acknowledgment

By submitting this pull request, I confirm that CipherStash can use, modify, copy, and redistribute this contribution, under the terms of CipherStash's choice.

Summary by CodeRabbit

  • New Features

    • Connection timeout protection added for database connections (default 120s).
  • Improvements

    • Timeout errors now report durations in milliseconds and are sent to clients promptly.
    • Improved handling to flush timeout responses and recover more gracefully.
  • Tests

    • New integration tests covering connection resilience and concurrent connection scenarios.
  • Documentation

    • Clarified build/test workflow: proxy image must be built locally; tests avoid pulling released images.
  • Chores

    • Local Docker image tagging updated; dependency patch bumped.

…r reporting

- Add configurable connection_timeout (default 120s, 0 to disable) to DatabaseConfig
- Fix ConnectionTimeout error message to display milliseconds (was showing seconds)
- Map ConnectionTimeout to PostgreSQL error code 57P05 instead of generic 58000
- Send ErrorResponse to client on timeout via channel writer
- Add pull_policy: never to docker-compose proxy services to ensure local builds
- Add connect_timeout to integration test client config
- Add unit tests for config defaults, error message format, and error code mapping
- Add integration tests for connection isolation under load
- Document local build requirement in DEVELOPMENT.md
- Pass actual timeout duration in ErrorResponse message to client
- Document rationale for using 57P05 over 08006 error code
- Document yield_now() as best-effort flush with known limitation
- Document advisory lock test race condition timing margin
- Replace magic number 12345 with named ADVISORY_LOCK_ID constant
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Adds connection-timeout support (default 120s; 0 disables) and surfaces timeouts as PostgreSQL 57P05 with millisecond messages; updates proxy handler to send timeout responses; adds connection-resilience integration tests; documents requirement to build proxy image locally and tags local Docker build as cipherstash/proxy:latest.

Changes

Cohort / File(s) Summary
Docs & CI
DEVELOPMENT.md, CLAUDE.md, mise.toml, tests/docker-compose.yml
Document that the proxy image must be built locally for tests, add macOS ulimit note, tag docker build as cipherstash/proxy:latest, conditional docker build progress flag, and set pull_policy: never in test compose.
Database config
packages/cipherstash-proxy/src/config/database.rs
Add DEFAULT_CONNECTION_TIMEOUT_MS (120_000); connection_timeout() returns None for Some(0), default when None; unit tests added.
Error formatting & mapping
packages/cipherstash-proxy/src/error.rs, packages/cipherstash-proxy/src/postgresql/messages/error_response.rs, packages/cipherstash-proxy/src/postgresql/error_handler.rs
ConnectionTimeout display uses milliseconds; ErrorResponse::connection_timeout now accepts a message string; map ConnectionTimeout to PostgreSQL 57P05 and add tests.
Proxy handler timeout flow
packages/cipherstash-proxy/src/postgresql/handler.rs
Intercept ConnectionTimeout from IO reads, send best-effort connection timeout ErrorResponse (flush/yield), terminate early; add private send_timeout_error helper.
Integration tests
packages/cipherstash-proxy-integration/src/common.rs, packages/cipherstash-proxy-integration/src/connection_resilience.rs, packages/cipherstash-proxy-integration/src/lib.rs
Expose get_database_port() as pub, set 10s connect timeout in tests, add connection_resilience module with 4 async tests covering slow queries, reconnects, concurrent load, and advisory-lock behavior; module declared in lib.
Dependency & manifest
packages/cipherstash-proxy/Cargo.toml, Cargo.toml
Bump metrics dependency from 0.24.1 to 0.24.3; Cargo.toml line changes included.
Tests compose
tests/docker-compose.yml
Add pull_policy: never to proxy services to ensure local images are used during integration tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Proxy
    participant DB

    Client->>Proxy: Open connection / send startup
    Proxy->>DB: Connect / read (with connection timeout)
    alt DB responds before timeout
        DB->>Proxy: Response
        Proxy->>Client: Continue protocol
    else ConnectionTimeout
        DB--xProxy: no response (timeout)
        Proxy->>Proxy: Build 57P05 ErrorResponse (msg in ms)
        Proxy->>Client: Send connection timeout ErrorResponse
        Proxy->>Proxy: Flush and teardown connection
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • coderdan
  • auxesis

Poem

🐰 I hopped through code to mend the flow,
Timers set so timeouts show,
Milliseconds now sing the cue,
Local builds keep tests true.
A cheerful rabbit hops—resilience grew!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Connection error handling' is partially related to the changeset. While error handling is a component of the changes (particularly in error.rs, error_handler.rs, and handler.rs), the PR encompasses significantly broader modifications including connection timeout configuration, resilience testing, Docker image tagging, build workflow documentation, and macOS troubleshooting notes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch connection-error-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs`:
- Around line 145-168: The readiness Notify (a_ready/a_ready_tx) can fire before
client_b is actually blocked, making the test flaky; replace the notify_one() +
fixed sleep logic with a deterministic wait that polls Postgres (e.g. pg_locks
or pg_stat_activity) on the direct connection until the proxy connection is
observed waiting for the advisory lock. Concretely: stop using
a_ready/a_ready_tx as the sole signal; after spawning b_handle (which uses
connect_with_tls(PROXY) and executes b_lock_query), poll from the direct
Postgres client (the direct connection used earlier) by running a query against
pg_locks/pg_stat_activity and check for client_b’s PID/status showing it is
waiting for the advisory lock, and only then proceed to measure connection C;
ensure the polling loop has a sensible timeout and fails the test if the wait
state never appears.

In `@packages/cipherstash-proxy/src/postgresql/handler.rs`:
- Around line 262-276: The ConnectionTimeout path must emit the ErrorResponse
not only after tokio::try_join! but also for early returns before the stream
split (the timeout occurrences at the earlier startup/auth branches referenced
in the review). Add a small helper (e.g.,
emit_connection_timeout(timeout_sender, err)) that builds
ErrorResponse::connection_timeout(err.to_string()), attempts
BytesMut::try_from(...), sends via timeout_sender, and does a
tokio::task::yield_now().await; then call that helper in the pre-split timeout
return sites (the branches that currently return Error::ConnectionTimeout around
the startup/password-exchange logic) instead of returning immediately, or move
the timeout-emission into a centralized early-return path so all
Error::ConnectionTimeouts use the same emission logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8410d5b-183a-44fe-9ca1-63f8f6ea7dda

📥 Commits

Reviewing files that changed from the base of the PR and between 4dff16e and 806eb59.

📒 Files selected for processing (11)
  • DEVELOPMENT.md
  • mise.toml
  • packages/cipherstash-proxy-integration/src/common.rs
  • packages/cipherstash-proxy-integration/src/connection_resilience.rs
  • packages/cipherstash-proxy-integration/src/lib.rs
  • packages/cipherstash-proxy/src/config/database.rs
  • packages/cipherstash-proxy/src/error.rs
  • packages/cipherstash-proxy/src/postgresql/error_handler.rs
  • packages/cipherstash-proxy/src/postgresql/handler.rs
  • packages/cipherstash-proxy/src/postgresql/messages/error_response.rs
  • tests/docker-compose.yml

metrics 0.24.1 has a lifetime issue (rust-lang/rust#141402) that causes
compilation failures on recent Rust versions.
@tobyhede tobyhede force-pushed the connection-error-handling branch from 468f4f2 to eec033f Compare March 10, 2026 03:19
Replace Notify + fixed sleep in advisory lock test with pg_locks polling
for deterministic synchronization. Send ErrorResponse to clients for
pre-split connection timeouts instead of silent disconnect.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
packages/cipherstash-proxy-integration/src/connection_resilience.rs (2)

16-18: ⚠️ Potential issue | 🟠 Major

Tie the advisory-lock poll to connection B, not just the lock ID.

Right now poll_query only proves that some backend is waiting on ADVISORY_LOCK_ID. Because the key is hard-coded at Line 18 and client_b is created inside the task, another concurrent suite can satisfy this poll before this test's session is actually blocked. Create client_b before tokio::spawn, capture its backend PID, and filter pg_locks by that PID (or generate a per-test lock key) so the readiness gate stays specific to this test.

In PostgreSQL, can `pg_locks` be filtered by backend `pid` to identify the session waiting on a specific advisory lock, and how are `pg_advisory_lock(bigint)` keys represented in `pg_locks`?

Also applies to: 141-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs` around
lines 16 - 18, The readiness poll currently checks pg_locks only by the
hard-coded ADVISORY_LOCK_ID (const ADVISORY_LOCK_ID) which can match a different
test's session; create the client used to acquire the lock (client_b) before
calling tokio::spawn so you can obtain its backend PID, then update poll_query
to filter pg_locks by that PID (or alternatively generate a unique per-test
ADVISORY_LOCK_ID) so the poll verifies that the specific client_b session is
blocked on pg_advisory_lock rather than any session using the same numeric key;
adjust the code paths that call poll_query (and the spawned task that runs the
lock) to use the captured PID or unique key accordingly.

32-33: ⚠️ Potential issue | 🟠 Major

Replace these sleep-based readiness gates with deterministic checks.

Line 33, Line 66, and Line 93 are still using fixed delays to guess when the previous state change has happened. That means these tests can pass before the slow workload is actually active, or only fail because teardown/startup took longer on CI. Please gate these assertions on an observable condition instead, similar to the pg_locks polling used later in this file, or remove the grace period entirely in the disconnect test so it exercises the immediate reconnect path.

Also applies to: 65-66, 92-93

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs` around
lines 32 - 33, The fixed-duration
tokio::time::sleep(Duration::from_millis(200)).await calls are nondeterministic
readiness gates; replace them with deterministic polling of an observable
condition (e.g., query the same pg_locks check used later in this file or poll
the proxy/connection state) so the test waits until the expected lock/worker
state is actually present before proceeding; alternatively, for the disconnect
test remove the grace period entirely and assert the immediate reconnect path by
checking the connection became active/inactive via the observable (use the same
pg_locks or connection-status query code path already in this file rather than
sleeping).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs`:
- Around line 16-18: The readiness poll currently checks pg_locks only by the
hard-coded ADVISORY_LOCK_ID (const ADVISORY_LOCK_ID) which can match a different
test's session; create the client used to acquire the lock (client_b) before
calling tokio::spawn so you can obtain its backend PID, then update poll_query
to filter pg_locks by that PID (or alternatively generate a unique per-test
ADVISORY_LOCK_ID) so the poll verifies that the specific client_b session is
blocked on pg_advisory_lock rather than any session using the same numeric key;
adjust the code paths that call poll_query (and the spawned task that runs the
lock) to use the captured PID or unique key accordingly.
- Around line 32-33: The fixed-duration
tokio::time::sleep(Duration::from_millis(200)).await calls are nondeterministic
readiness gates; replace them with deterministic polling of an observable
condition (e.g., query the same pg_locks check used later in this file or poll
the proxy/connection state) so the test waits until the expected lock/worker
state is actually present before proceeding; alternatively, for the disconnect
test remove the grace period entirely and assert the immediate reconnect path by
checking the connection became active/inactive via the observable (use the same
pg_locks or connection-status query code path already in this file rather than
sleeping).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b6eefc7-81f1-422b-bca9-856c9391b822

📥 Commits

Reviewing files that changed from the base of the PR and between 806eb59 and 5e936ad.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • packages/cipherstash-proxy-integration/src/connection_resilience.rs
  • packages/cipherstash-proxy/Cargo.toml
  • packages/cipherstash-proxy/src/config/database.rs
  • packages/cipherstash-proxy/src/postgresql/error_handler.rs
  • packages/cipherstash-proxy/src/postgresql/handler.rs

Copy link
Contributor

@freshtonic freshtonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but the tests are failing. I've added a comment with my hypothesis as to why.

client_b
.simple_query(&b_lock_query)
.await
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing and I think it's because there should a sleep in between acquiring and releasing so that the has_waiting check on line 160 has a chance of observing client_b holding the lock.

client_a was hardcoded to PG_PORT (5532) while the proxy may connect to
a different PostgreSQL instance (e.g. postgres-tls:5617 in TLS CI). This
meant the advisory lock was held on a different server than client_b was
targeting, so the lock never blocked and the test was flaky in CI.

Use get_database_port() to ensure client_a connects to the same
PostgreSQL instance the proxy uses.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
CLAUDE.md (1)

59-63: Consider enhancing the note with additional context.

The current guidance is helpful and actionable. To make it more comprehensive, consider these optional enhancements:

  1. Explain the context: Mention that this is particularly relevant when running integration tests with many concurrent connections
  2. Shell inclusivity: Reference both Zsh (.zshrc) and Bash (.bash_profile or .bashrc) users
  3. macOS system limits: Note that on some macOS versions, the system-wide hard limit might also need adjustment (e.g., via launchctl limit maxfiles)
📝 Optional enhanced version
-> **macOS Note:** If you hit file descriptor limits during development (e.g. "Too many open files"), you may need to increase the limit:
+> **macOS Note:** If you hit file descriptor limits during development (e.g., "Too many open files" when running integration tests with many concurrent connections), you may need to increase the limit:
 > ```bash
 > ulimit -n 10240
 > ```
-> To make this persistent, add it to your shell profile (e.g. `~/.zshrc`).
+> To make this persistent, add it to your shell profile (e.g., `~/.zshrc` for Zsh or `~/.bash_profile` for Bash). On some macOS versions, you may also need to adjust system-wide limits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 59 - 63, Update the macOS note around the "ulimit -n
10240" guidance: add a brief sentence that this is most relevant when running
integration tests or workloads with many concurrent connections, mention
persistence instructions for both Zsh and Bash (e.g., "~/.zshrc" and
"~/.bash_profile" or "~/.bashrc"), and add a short note that some macOS versions
may also require adjusting system-wide hard limits (e.g., via "launchctl limit
maxfiles") so users know to check both per-shell and system limits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@CLAUDE.md`:
- Around line 59-63: Update the macOS note around the "ulimit -n 10240"
guidance: add a brief sentence that this is most relevant when running
integration tests or workloads with many concurrent connections, mention
persistence instructions for both Zsh and Bash (e.g., "~/.zshrc" and
"~/.bash_profile" or "~/.bashrc"), and add a short note that some macOS versions
may also require adjusting system-wide hard limits (e.g., via "launchctl limit
maxfiles") so users know to check both per-shell and system limits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5dd5496e-25bb-475a-8079-3318c0e80ee3

📥 Commits

Reviewing files that changed from the base of the PR and between fc8544d and 7e8331d.

📒 Files selected for processing (1)
  • CLAUDE.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants