[ET-VK][q8ta] Add q8ta_linear_gemv op for batch-1 int8 linear#17566
[ET-VK][q8ta] Add q8ta_linear_gemv op for batch-1 int8 linear#17566SS-JIA wants to merge 7 commits intogh/SS-JIA/440/basefrom
Conversation
Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17566
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 Unrelated FailureAs of commit 5c5dd83 with merge base 9a58ce8 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]
|
This pull request has been merged in 187dc1d. |
Stack from ghstack (oldest at bottom):
Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction.
The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output.
The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches.
Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline.
Authored with Claude.
Differential Revision: D93768643