Skip to content

Fix crash due to asio object lifetime and thread safety issue#551

Open
BewareMyPower wants to merge 3 commits intoapache:mainfrom
BewareMyPower:bewaremypower/fix-connect-crash
Open

Fix crash due to asio object lifetime and thread safety issue#551
BewareMyPower wants to merge 3 commits intoapache:mainfrom
BewareMyPower:bewaremypower/fix-connect-crash

Conversation

@BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Mar 9, 2026

Fixes: #550

Motivation

ClientTest.testCloseClient is flaky that sometimes it could cause segmentation fault. This is because client.close() could be called quickly after createProducerAsync. In this case, the async_connect operation might not complete. In the close method, socket_->close() is called, which could complete the async_connect with operation_aborted. In asio, when the completion handler (callback) is called while the caller (socket) is destroyed, it will crash because the callback refers the caller internally.

There are several issues.

Issue 1: Thread safety issue when accessing asio objects (e.g. socket)

ip::tcp::socket::close is called by ClientConnection::close, which could be called in a separated thread created by ClientImpl::handleClose or directly by ClientImpl::shutdown. However, it's unsafe to access socket concurrently, see https://think-async.com/Asio/asio-1.36.0/doc/asio/reference/ip__tcp/socket.html

Shared objects: Unsafe.

Synchronous send, receive, connect, and shutdown operations are thread safe with respect to each other, if the underlying operating system calls are also thread safe. This means that it is permitted to perform concurrent calls to these synchronous operations on a single socket object. Other synchronous operations, such as open or close, are not thread safe.

ssl::stream has the same issue, that could be why the operations on tlsSocket_ were bound to strand_ before. Actually, it's also unsafe to call async_write directly on socket_ because it could be called by ProducerImpl::sendMessage in a different thread from the event loop thread.

Issue 2: The lifetime of asio objects might be shorter than the callbacks

#317 might not fix the null access issue correctly. It introduced several crash issues (#325, #326) by capturing a weak_ptr. The issue is that ConnectionPool::close don't wait until all pending asynchronous operations are done. Then, after the ClientConnection object is destroyed, the asio objects are also destroyed. However, the io_context could still try accessing them when destroying the internal callback queue.

Modifications

  • Capture shared_from_this() in all callbacks of asynchronous operations.
  • In close, dispatch the socket::close() call in event loop thread. For ssl::stream object, call async_shutdown. The previous close on lower_layer is actually equivalent with closing the socket.
  • In ConnectionPool::close, wait until all pending operations are done by checking the reference count of ClientConnection
  • Since PeriodicalTask holds a reference to connection via the callback, destroying it in close
  • Fix some tests failing in CI

@BewareMyPower BewareMyPower added this to the 4.1.0 milestone Mar 9, 2026
@BewareMyPower BewareMyPower self-assigned this Mar 9, 2026
@BewareMyPower BewareMyPower added the bug Something isn't working label Mar 9, 2026
@BewareMyPower BewareMyPower marked this pull request as draft March 9, 2026 03:58
@BewareMyPower BewareMyPower changed the title Fix: keep socket alive during async_connect to prevent use-after-free crash Fix crash from asio completion handler during Client::close with pending operations Mar 9, 2026
@BewareMyPower BewareMyPower changed the title Fix crash from asio completion handler during Client::close with pending operations Fix crash from asio callback during Client::close with pending operations Mar 9, 2026
@BewareMyPower BewareMyPower marked this pull request as ready for review March 9, 2026 13:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a crash caused by ASIO completion handlers running during/after Client::close() when there are still in-flight network operations (notably pending async_connect), which can lead to use-after-free of socket / io_context internals during shutdown.

Changes:

  • Adds tracking of in-flight (non-timer) async operations in ClientConnection and exposes a pendingOperations() counter.
  • Adjusts connection shutdown behavior to avoid closing the socket while the TCP connect is still pending, deferring close handling to handleTcpConnected.
  • Makes ConnectionPool::close() wait (up to 5s) for each connection’s pending operations to complete before clearing the pool.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/PulsarFriend.h Uses C++17 lock-guard deduction when locking ClientConnection internals in tests.
lib/MockServer.h Updates mutex locking to use C++17 lock-guard deduction for consistency.
lib/ConnectionPool.cc Waits for pending async ops per connection during pool shutdown to reduce shutdown-time crashes.
lib/ClientConnection.h Introduces pendingOperations_ counter, exposes accessor, and changes mutex type to recursive_mutex.
lib/ClientConnection.cc Instruments connect/handshake/read/write paths for pending-op tracking and modifies shutdown/connect completion handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 1102 to 1109
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this path, pendingOperations_ will not be increased when sendCommand returns. It is only increased in sendCommandInternal after this callback is executed.

This means the issue that existed before this PR can still happen: when the connection is being closed, there might still be an in-flight callback that hasn't been counted yet.

@BewareMyPower BewareMyPower marked this pull request as draft March 11, 2026 04:08
@BewareMyPower BewareMyPower changed the title Fix crash from asio callback during Client::close with pending operations Fix crash due to asio object lifetime and thread safety issue Mar 11, 2026
@BewareMyPower BewareMyPower force-pushed the bewaremypower/fix-connect-crash branch from 6e3901c to 94b70bf Compare March 11, 2026 13:46
@BewareMyPower BewareMyPower marked this pull request as ready for review March 11, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Client.close might crash

4 participants