[ET-VK][q8ta] Add q8ta_linear_gemv op for batch-1 int8 linear by SS-JIA · Pull Request #17566 · pytorch/executorch

SS-JIA · 2026-02-19T19:48:48Z

Stack from ghstack (oldest at bottom):

Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction.

The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output.

The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches.

Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline.

Authored with Claude.

Differential Revision: D93768643

Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]

pytorch-bot · 2026-02-19T19:48:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17566

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Unrelated Failure

As of commit 5c5dd83 with merge base 9a58ce8 ():

NEW FAILURES - The following jobs have failed:

pull / test-mediatek-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 2e173131eb706d5a1742d243bef482ab0ad194dad6c42050016d71eeaf66adab /exec failed with exit code 2
pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t a50b1ff2e44009863d19dd0362877e0f6d66ed257fd0caab8df80060bf8e8784 /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
test_edsr_fp16
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 6bb83453aded1a5abe04680f6e2e64c76a9b15a01898ec885cfbf5c3e4a46a3a /exec failed with exit code 3

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-19T19:49:57Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ear" Add a cooperative GEMV variant of q8ta_linear optimized for batch size 1. The existing q8ta_linear uses a tiled algorithm with 4H4W packed int8 layout, which is inefficient for single-row inputs because it wastes 3/4 of each ivec4 block. The new q8ta_linear_gemv uses 4W packed int8 layout (scalar int[] buffers) and a cooperative algorithm where 64 threads split the K reduction dimension with shared memory tree reduction. The shader loads one packed int32 (4 int8 values) per thread per K iteration and accumulates dot products against the weight tile using dotPacked4x8AccSatEXT. After reduction, thread 0 applies scales, zero points, bias, and quantizes the output. The pattern matcher in quantized_linear.py selects q8ta_linear_gemv when the input batch dimension is 1, falling back to q8ta_linear for larger batches. Also adds PACKED_INT8_4W (value 5) to the serialization schema to support the 4W memory layout in the export pipeline. Authored with Claude. Differential Revision: [D93768643](https://our.internmc.facebook.com/intern/diff/D93768643/) [ghstack-poisoned]

meta-codesync · 2026-02-21T17:09:13Z

This pull request has been merged in 187dc1d.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2026

This was referenced Feb 19, 2026

[ET-VK][q8ta] Fix addmm arg indexing in QuantizedLinearMatch #17567

Closed

[ET-VK][ez][qconv] Add auto-selection to prefer im2col for q8ta_conv2d #17568

Closed

meta-codesync bot added fb-exported meta-exported labels Feb 19, 2026

manuelcandales approved these changes Feb 19, 2026

View reviewed changes

ssjia added 6 commits February 20, 2026 15:58

SS-JIA closed this in 187dc1d Feb 21, 2026

SS-JIA had a problem deploying to cherry-pick-bot February 21, 2026 17:08 — with GitHub Actions Failure

facebook-github-bot added the Merged label Feb 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[ET-VK][q8ta] Add q8ta_linear_gemv op for batch-1 int8 linear#17566

[ET-VK][q8ta] Add q8ta_linear_gemv op for batch-1 int8 linear#17566
SS-JIA wants to merge 7 commits intogh/SS-JIA/440/basefrom
gh/SS-JIA/440/head

SS-JIA commented Feb 19, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

meta-codesync bot commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

SS-JIA commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17566

❌ 4 New Failures, 1 Unrelated Failure

Uh oh!

github-actions bot commented Feb 19, 2026

This PR needs a release notes: label

Uh oh!

meta-codesync bot commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SS-JIA commented Feb 19, 2026 •

edited

Loading

pytorch-bot bot commented Feb 19, 2026 •

edited

Loading

This PR needs a `release notes:` label