perf: executePlan uses a channel to park executor task thread instead of yield_now() [iceberg] by mbutrovich · Pull Request #3553 · apache/datafusion-comet

mbutrovich · 2026-02-20T02:52:01Z

Which issue does this PR close?

N/A.

Rationale for this change

I am observing heavy CPU utilization on I/O, latency-bound workloads (e.g., tens of thousands of small Iceberg FileScanTasks on an object store). It looks like a busy-poll, but we've done some work to try to address that already (#2937, #2938, #3063). Those eliminated some of the sources of tokio overhead, but we're still seeing high CPU utilization on workloads that shouldn't have them. So, I went searching for anything that would lead to busy-poll like behavior.

In executePlan the task executor thread does a block_on the stream, it's Pending, and then we yield. However, if the I/O tasks aren't done, we just wake up again, check the stream, it's still Pending, and we yield again. This essentially degrades to a busy-poll, and results in a ton of scheduling overhead in tokio.

Tokio's docs for https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#method.block_on notes the challenges of mixing block_on behavior with other futures. Combined with yield_now (which re-enqueues immediately rather than waiting for a waker), the executor thread spins checking for data that isn't ready yet instead of parking until an I/O completion wakes it (which is what I intended in #3063).

What changes are included in this PR?

For scenarios where we don't have any scans that need to pull batches from JVM, we set up a channel and pass that task into the tokio worker pool. This lets the tokio worker pool handle stream execution and allows the executor task thread from the JVM (that made the call into executePlan in jni_api.rs) to properly wait on a batch arriving.

How are these changes tested?

Existing tests. I also benchmarked before and after with a workload that 1) creates an Iceberg table in Minio (S3) with 10,000 small data files 2) runs a query that reads all of the data. I couldn't simulate the latency added by the object store, but you can see the difference:

main
CPU usage hovers around 30%, the executor task threads are almost solid green doing "work" spinning, and average query execution time of the 10 iterations (that you can see in the chart) was 5627 ms.

This PR
CPU usage hovers around 8%, the executor task has much more red (parked), and average query execution time of the 10 iterations (that you can see in the chart) was 5201 ms.

mbutrovich · 2026-02-20T12:11:52Z

I updated the development guide with design considerations for thread-local data and JNI in this architecture. I will try to get more benchmarking results today.

andygrove

🚀

Switch to a channel instead of await() on Pending.

1f8fb75

mbutrovich changed the title ~~perf: Switch to a channel instead of await() on Pending [iceberg]~~ perf: Switch to a channel instead of await() on Pending during executePlan [iceberg] Feb 20, 2026

mbutrovich changed the title ~~perf: Switch to a channel instead of await() on Pending during executePlan [iceberg]~~ perf: Switch to a channel instead of yield_now() on Pending during executePlan [iceberg] Feb 20, 2026

mbutrovich changed the title ~~perf: Switch to a channel instead of yield_now() on Pending during executePlan [iceberg]~~ perf: executePlan uses a channel to park executor task thread instead of yield_now() on Pending [iceberg] Feb 20, 2026

Update development guide with JNI concerns with this change.

3971894

mbutrovich marked this pull request as ready for review February 20, 2026 12:12

mbutrovich requested review from andygrove and comphead February 20, 2026 12:12

mbutrovich added 2 commits February 20, 2026 07:15

Docs.

d1879b1

Docs.

94f1d1a

mbutrovich changed the title ~~perf: executePlan uses a channel to park executor task thread instead of yield_now() on Pending [iceberg]~~ perf: executePlan uses a channel to park executor task thread instead of yield_now() [iceberg] Feb 20, 2026

andygrove approved these changes Feb 20, 2026

View reviewed changes

mbutrovich removed the request for review from comphead February 20, 2026 12:52

mbutrovich merged commit f697d27 into apache:main Feb 20, 2026
194 of 196 checks passed

mbutrovich deleted the channel branch February 20, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: executePlan uses a channel to park executor task thread instead of yield_now() [iceberg]#3553

perf: executePlan uses a channel to park executor task thread instead of yield_now() [iceberg]#3553
mbutrovich merged 4 commits intoapache:mainfrom
mbutrovich:channel

mbutrovich commented Feb 20, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Feb 20, 2026

Uh oh!

andygrove left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mbutrovich commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented Feb 20, 2026

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

mbutrovich commented Feb 20, 2026 •

edited

Loading