perf: executePlan uses a channel to park executor task thread instead of yield_now() [iceberg]#3553
Merged
mbutrovich merged 4 commits intoapache:mainfrom Feb 20, 2026
Merged
Conversation
Contributor
Author
|
I updated the development guide with design considerations for thread-local data and JNI in this architecture. I will try to get more benchmarking results today. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A.
Rationale for this change
I am observing heavy CPU utilization on I/O, latency-bound workloads (e.g., tens of thousands of small Iceberg FileScanTasks on an object store). It looks like a busy-poll, but we've done some work to try to address that already (#2937, #2938, #3063). Those eliminated some of the sources of tokio overhead, but we're still seeing high CPU utilization on workloads that shouldn't have them. So, I went searching for anything that would lead to busy-poll like behavior.
In
executePlanthe task executor thread does ablock_onthe stream, it's Pending, and then we yield. However, if the I/O tasks aren't done, we just wake up again, check the stream, it's still Pending, and we yield again. This essentially degrades to a busy-poll, and results in a ton of scheduling overhead in tokio.Tokio's docs for https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#method.block_on notes the challenges of mixing
block_onbehavior with other futures. Combined withyield_now(which re-enqueues immediately rather than waiting for a waker), the executor thread spins checking for data that isn't ready yet instead of parking until an I/O completion wakes it (which is what I intended in #3063).What changes are included in this PR?
For scenarios where we don't have any scans that need to pull batches from JVM, we set up a channel and pass that task into the tokio worker pool. This lets the tokio worker pool handle stream execution and allows the executor task thread from the JVM (that made the call into
executePlanin jni_api.rs) to properly wait on a batch arriving.How are these changes tested?
Existing tests. I also benchmarked before and after with a workload that 1) creates an Iceberg table in Minio (S3) with 10,000 small data files 2) runs a query that reads all of the data. I couldn't simulate the latency added by the object store, but you can see the difference:
main

CPU usage hovers around 30%, the executor task threads are almost solid green doing "work" spinning, and average query execution time of the 10 iterations (that you can see in the chart) was 5627 ms.
This PR

CPU usage hovers around 8%, the executor task has much more red (parked), and average query execution time of the 10 iterations (that you can see in the chart) was 5201 ms.