[ET-VK] Experimental FP Linear Implementation with NV cooperative matrix 2 extension by HarryHu-art · Pull Request #17581 · pytorch/executorch

HarryHu-art · 2026-02-20T06:00:52Z

Stack from ghstack (oldest at bottom):

Experimental FP Linear Implementation with NV cooperativate matrix 2

 buck run  @fbcode/mode/win //xplat/executorch/backends/vulkan/test/custom_ops:test_fp_linear
>>
File changed: fbcode//executorch/backends/vulkan/runtime/gen_vulkan_spv.py
File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/glsl/pack_fp_linear_weight.yaml
File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/impl/LinearExperimental.cpp
15 additional file change events
Buck UI: https://www.internalfb.com/buck2/34f0710d-d349-4cba-9e35-10926968dd39
Network: Up: 0B  Down: 0B
Command: run.
Time elapsed: 19.0s
BUILD SUCCEEDED - starting your binary

=== Compute Shader Performance Benchmark ===
FP32/FP16 Linear Layer Benchmark
----------------------------------------------------------------------

=== Cooperative Matrix Properties ===
Loader Message 0 Inserted device layer "VK_LAYER_KHRONOS_validation" (C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll)
Loader Message 0 Inserted device layer "VK_LAYER_NV_present" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll)
Loader Message 0 Inserted device layer "VK_LAYER_NV_optimus" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll)
Loader Message 0 vkCreateDevice layer callstack setup to:
Loader Message 0    <Application>
Loader Message 0      ||
Loader Message 0    <Loader>
Loader Message 0      ||
Loader Message 0    VK_LAYER_NV_optimus
Loader Message 0            Type: Implicit
Loader Message 0            Enabled By: Implicit Layer
Loader Message 0                Disable Env Var:  DISABLE_LAYER_NV_OPTIMUS_1
Loader Message 0            Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json
Loader Message 0            Library:  C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll
Loader Message 0      ||
Loader Message 0    VK_LAYER_NV_present
Loader Message 0            Type: Implicit
Loader Message 0            Enabled By: Implicit Layer
Loader Message 0                Disable Env Var:  DISABLE_LAYER_NV_PRESENT_1
Loader Message 0            Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json
Loader Message 0            Library:  C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll
Loader Message 0      ||
Loader Message 0    VK_LAYER_KHRONOS_validation
Loader Message 0            Type: Explicit
Loader Message 0            Enabled By: By the Application
Loader Message 0            Manifest: C:\VulkanSDK\1.4.321.1\Bin\VkLayer_khronos_validation.json
Loader Message 0            Library:  C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll
Loader Message 0      ||
Loader Message 0    <Device>
Loader Message 0        Using "NVIDIA GeForce RTX 5080" with driver: "C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll"
Validation 0 vkCreateImage(): The following VkImageCreateInfo returned VK_ERROR_FORMAT_NOT_SUPPORTED when calling vkGetPhysicalDeviceImageFormatProperties2
format (VK_FORMAT_R32G32B32A32_SFLOAT)
type (VK_IMAGE_TYPE_3D)
tiling (VK_IMAGE_TILING_LINEAR)
usage (VK_IMAGE_USAGE_SAMPLED_BIT|VK_IMAGE_USAGE_STORAGE_BIT)
flags (VkImageCreateFlags(0))
VkImageCreateInfo::pNext is NULL.
The Vulkan spec states: Each of the following values (as described in Image Creation Limits) must not be undefined : imageCreateMaxMipLevels, imageCreateMaxArrayLayers, imageCreateMaxExtent, and imageCreateSampleCounts (https://vulkan.lunarg.com/doc/view/1.4.321.1/windows/antora/spec/latest/chapters/resources.html#VUID-VkImageCreateInfo-imageCreateMaxMipLevels-02251)
Found 15 cooperative matrix configurations:
----------------------------------------------------------------------
  #  |   M  |   N  |   K  | A Type  | B Type  | C Type  | R Type  | Scope
----------------------------------------------------------------------
   0 |   16 |   16 |   16 | float16 | float16 | float16 | float16 | Subgroup
   1 |   16 |    8 |   16 | float16 | float16 | float16 | float16 | Subgroup
   2 |   16 |    8 |    8 | float16 | float16 | float16 | float16 | Subgroup
   3 |   16 |   16 |   16 | float16 | float16 | float32 | float32 | Subgroup
   4 |   16 |    8 |   16 | float16 | float16 | float32 | float32 | Subgroup
   5 |   16 |    8 |    8 | float16 | float16 | float32 | float32 | Subgroup
   6 |   16 |   16 |   32 | uint8   | uint8   | uint32  | uint32  | Subgroup
   7 |   16 |   16 |   32 | int8    | int8    | int32   | int32   | Subgroup
   8 |   16 |    8 |   32 | uint8   | uint8   | uint32  | uint32  | Subgroup
   9 |   16 |    8 |   32 | int8    | int8    | int32   | int32   | Subgroup
  10 |   16 |   16 |   16 | unknown | unknown | float32 | float32 | Subgroup
  11 |   16 |   16 |   32 | unknown | unknown | float16 | float16 | Subgroup
  12 |   16 |   16 |   32 | unknown | unknown | float32 | float32 | Subgroup
  13 |   16 |   16 |   32 | unknown | unknown | float16 | float16 | Subgroup
  14 |   16 |   16 |   32 | unknown | unknown | float32 | float32 | Subgroup
----------------------------------------------------------------------

Configurations with float32 A, B, C types:

Configurations with float16 A/B, float32 C (mixed precision):
  M=16, N=16, K=16, Scope=Subgroup
  M=16, N=8, K=16, Scope=Subgroup
  M=16, N=8, K=8, Scope=Subgroup

Test: ACCU  B=4  I=128  O=128  Buf  fp16+bias L

input_tensor Data:
  Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
  Total elements: 512
  Data (first 64 elements): [-0.250732, 0.592773, 0.901367, -0.632812, 0.463867, 0.559082, 0.197266, 0.193604, -0.687500, -0.108276, -0.687988, -0.799805, -0.883789, -0.081482, 0.731934, -0.332520, 0.202148, -0.713867, 0.416016, 0.301758, -0.958496, -0.886719, 0.939453, 0.443848, 0.664551, 0.876953, -0.575195, -0.998047, -0.636230, 0.984375, -0.632812, 0.234863, -0.391357, 0.223267, 0.049500, -0.985840, -0.136108, -0.953613, -0.417480, 0.049530, 0.223633, -0.200195, -0.720703, -0.906250, -0.415527, 0.947266, -0.267090, -0.534180, -0.087830, -0.818359, 0.570312, 0.236694, -0.600586, -0.234985, 0.028458, 0.966309, 0.184814, -0.066467, -0.906738, 0.719727, 0.215088, 0.360596, -0.658691, -0.098999, ... (448 more)]
  Statistics: min=0.229682, max=1.468703, mean=0.922048, sum=472.088684

weight_tensor Data:
  Type: ValueSpec(type=Tensor, sizes=[128, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
  Total elements: 16384
  Data (first 64 elements): [-0.769531, -0.006275, 0.218018, -0.793945, -0.732910, -0.702148, -0.518555, -0.655273, -0.345703, 0.622070, 0.718262, -0.850098, 0.332031, -0.909668, 0.082275, 0.224243, -0.941895, 0.189575, 0.467285, -0.498535, -0.210083, -0.705078, 0.604004, -0.977051, -0.490967, -0.063721, -0.885742, 0.909180, 0.732910, 0.111511, -0.557617, 0.299316, -0.189941, 0.161743, -0.367676, -0.662109, -0.846191, -0.168213, 0.686035, 0.305908, 0.697754, 0.241089, 0.942871, -0.192383, -0.229126, 0.747070, 0.908691, 0.110107, -0.108459, -0.142090, 0.339355, -0.720215, -0.834961, -0.539062, 0.793945, -0.605469, -0.403809, 0.596680, -0.475342, -0.335205, -0.989258, -0.157715, 0.086365, -0.110352, ... (16320 more)]
  Statistics: min=0.013550, max=1.468764, mean=0.913034, sum=14959.146484

bias_tensor Data:
  Type: ValueSpec(type=Tensor, sizes=[128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
  Total elements: 128
  Data (first 64 elements): [0.669434, -0.134888, -0.790039, -0.936035, 0.489258, 0.255615, -0.278809, -0.630859, -0.281250, -0.407227, 0.218384, -0.485840, -0.212402, 0.359619, -0.181763, -0.434814, 0.019791, 0.833496, 0.420166, -0.583496, 0.920898, 0.262939, -0.086731, 0.597168, -0.144653, 0.065247, -0.772949, -0.650391, -0.563965, -0.645020, 0.914551, 0.467285, 0.886230, -0.200317, 0.763184, 0.024109, 0.292725, -0.917969, -0.572266, -0.051605, 0.273438, -0.978027, -0.721680, 0.602051, -0.082581, -0.718262, 0.747559, -0.410889, -0.482910, 0.708496, 0.329590, 0.599609, 0.725098, -0.226318, -0.702148, 0.677734, 0.125854, -0.276367, -0.681641, 0.816406, -0.653809, -0.542480, -0.791504, -0.032104, ... (64 more)]
  Statistics: min=0.289590, max=1.467422, mean=0.940261, sum=120.353394
Executing 1 test cases for FPLinear
----------------------------------------------------------------------
==================== Shared Object List ====================
   idx               sizes                   users
     0                  16                    [7,]
==================== Value List ============================
   idx      type               sizes node_type  storage_bytes    so_idx
     0    TENSOR            [4, 128]     INPUT           1024
     1   STAGING
     2 TENSORREF          [128, 128]   PREPACK
     3 TENSORREF               [128]   PREPACK
     4       INT
     5    TENSOR            [4, 128]    OUTPUT           1024
     6    TENSOR          [128, 128]   PREPACK          32768
     7    TENSOR                  []                       16         0
     8    TENSOR               [128]   PREPACK            256
     9   STAGING
==================== Prepack Node List =====================
   idx                     shader_name    tref  packed
     0pack_fp_linear_weight_buffer_half       2       6
     1        nchw_to_buffer_half_half       3       8
==================== Execute Node List =====================
   idx                     shader_name                READ_arg               WRITE_arg
     0       nchw_to_buffer_half_float                    [1,]                    [0,]
     1        linear_tiled_nv_cm2_half                [0,6,8,]                    [5,]
     2       buffer_to_nchw_half_float                    [5,]                    [9,]

Output[0] Data:
  Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS)
  Total elements: 512
  Data (first 20 elements): [-1.498047, -3.060547, -0.001218, -2.798828, 1.934570, 2.558594, 3.414062, 5.820312, -2.109375, -3.806641, 3.628906, 0.022491, -1.992188, -1.054688, 0.149658, -8.593750, -0.514648, 8.828125, 3.468750, -1.464844, ... (492 more)]
  Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243

Output[0] (ref) Data:
  Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS)
  Total elements: 512
  Data (first 20 elements): [-1.499553, -3.061874, -0.002150, -2.799306, 1.934955, 2.555415, 3.412702, 5.819950, -2.108622, -3.801892, 3.630069, 0.021923, -1.991505, -1.054845, 0.149873, -8.598531, -0.516323, 8.826857, 3.467027, -1.466808, ... (492 more)]
  Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243
Correctness validation PASSED
linear_tiled_nv_cm2_half                           ACCU  B=4  I=128  O=128  Buf  fp16+bias L                                 [4x128]           5.920 ╬╝s         11.070 GFLOP/s   PASSED
----------------------------------------------------------------------

Differential Revision: D91945037

…rix 2 extension Experimental FP Linear Implementation with NV cooperativate matrix 2 ``` buck run @fbcode/mode/win //xplat/executorch/backends/vulkan/test/custom_ops:test_fp_linear >> File changed: fbcode//executorch/backends/vulkan/runtime/gen_vulkan_spv.py File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/glsl/pack_fp_linear_weight.yaml File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/impl/LinearExperimental.cpp 15 additional file change events Buck UI: https://www.internalfb.com/buck2/34f0710d-d349-4cba-9e35-10926968dd39 Network: Up: 0B Down: 0B Command: run. Time elapsed: 19.0s BUILD SUCCEEDED - starting your binary === Compute Shader Performance Benchmark === FP32/FP16 Linear Layer Benchmark ---------------------------------------------------------------------- === Cooperative Matrix Properties === Loader Message 0 Inserted device layer "VK_LAYER_KHRONOS_validation" (C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll) Loader Message 0 Inserted device layer "VK_LAYER_NV_present" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll) Loader Message 0 Inserted device layer "VK_LAYER_NV_optimus" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll) Loader Message 0 vkCreateDevice layer callstack setup to: Loader Message 0 <Application> Loader Message 0 || Loader Message 0 <Loader> Loader Message 0 || Loader Message 0 VK_LAYER_NV_optimus Loader Message 0 Type: Implicit Loader Message 0 Enabled By: Implicit Layer Loader Message 0 Disable Env Var: DISABLE_LAYER_NV_OPTIMUS_1 Loader Message 0 Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json Loader Message 0 Library: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll Loader Message 0 || Loader Message 0 VK_LAYER_NV_present Loader Message 0 Type: Implicit Loader Message 0 Enabled By: Implicit Layer Loader Message 0 Disable Env Var: DISABLE_LAYER_NV_PRESENT_1 Loader Message 0 Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json Loader Message 0 Library: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll Loader Message 0 || Loader Message 0 VK_LAYER_KHRONOS_validation Loader Message 0 Type: Explicit Loader Message 0 Enabled By: By the Application Loader Message 0 Manifest: C:\VulkanSDK\1.4.321.1\Bin\VkLayer_khronos_validation.json Loader Message 0 Library: C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll Loader Message 0 || Loader Message 0 <Device> Loader Message 0 Using "NVIDIA GeForce RTX 5080" with driver: "C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll" Validation 0 vkCreateImage(): The following VkImageCreateInfo returned VK_ERROR_FORMAT_NOT_SUPPORTED when calling vkGetPhysicalDeviceImageFormatProperties2 format (VK_FORMAT_R32G32B32A32_SFLOAT) type (VK_IMAGE_TYPE_3D) tiling (VK_IMAGE_TILING_LINEAR) usage (VK_IMAGE_USAGE_SAMPLED_BIT|VK_IMAGE_USAGE_STORAGE_BIT) flags (VkImageCreateFlags(0)) VkImageCreateInfo::pNext is NULL. The Vulkan spec states: Each of the following values (as described in Image Creation Limits) must not be undefined : imageCreateMaxMipLevels, imageCreateMaxArrayLayers, imageCreateMaxExtent, and imageCreateSampleCounts (https://vulkan.lunarg.com/doc/view/1.4.321.1/windows/antora/spec/latest/chapters/resources.html#VUID-VkImageCreateInfo-imageCreateMaxMipLevels-02251) Found 15 cooperative matrix configurations: ---------------------------------------------------------------------- # | M | N | K | A Type | B Type | C Type | R Type | Scope ---------------------------------------------------------------------- 0 | 16 | 16 | 16 | float16 | float16 | float16 | float16 | Subgroup 1 | 16 | 8 | 16 | float16 | float16 | float16 | float16 | Subgroup 2 | 16 | 8 | 8 | float16 | float16 | float16 | float16 | Subgroup 3 | 16 | 16 | 16 | float16 | float16 | float32 | float32 | Subgroup 4 | 16 | 8 | 16 | float16 | float16 | float32 | float32 | Subgroup 5 | 16 | 8 | 8 | float16 | float16 | float32 | float32 | Subgroup 6 | 16 | 16 | 32 | uint8 | uint8 | uint32 | uint32 | Subgroup 7 | 16 | 16 | 32 | int8 | int8 | int32 | int32 | Subgroup 8 | 16 | 8 | 32 | uint8 | uint8 | uint32 | uint32 | Subgroup 9 | 16 | 8 | 32 | int8 | int8 | int32 | int32 | Subgroup 10 | 16 | 16 | 16 | unknown | unknown | float32 | float32 | Subgroup 11 | 16 | 16 | 32 | unknown | unknown | float16 | float16 | Subgroup 12 | 16 | 16 | 32 | unknown | unknown | float32 | float32 | Subgroup 13 | 16 | 16 | 32 | unknown | unknown | float16 | float16 | Subgroup 14 | 16 | 16 | 32 | unknown | unknown | float32 | float32 | Subgroup ---------------------------------------------------------------------- Configurations with float32 A, B, C types: Configurations with float16 A/B, float32 C (mixed precision): M=16, N=16, K=16, Scope=Subgroup M=16, N=8, K=16, Scope=Subgroup M=16, N=8, K=8, Scope=Subgroup Test: ACCU B=4 I=128 O=128 Buf fp16+bias L input_tensor Data: Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM) Total elements: 512 Data (first 64 elements): [-0.250732, 0.592773, 0.901367, -0.632812, 0.463867, 0.559082, 0.197266, 0.193604, -0.687500, -0.108276, -0.687988, -0.799805, -0.883789, -0.081482, 0.731934, -0.332520, 0.202148, -0.713867, 0.416016, 0.301758, -0.958496, -0.886719, 0.939453, 0.443848, 0.664551, 0.876953, -0.575195, -0.998047, -0.636230, 0.984375, -0.632812, 0.234863, -0.391357, 0.223267, 0.049500, -0.985840, -0.136108, -0.953613, -0.417480, 0.049530, 0.223633, -0.200195, -0.720703, -0.906250, -0.415527, 0.947266, -0.267090, -0.534180, -0.087830, -0.818359, 0.570312, 0.236694, -0.600586, -0.234985, 0.028458, 0.966309, 0.184814, -0.066467, -0.906738, 0.719727, 0.215088, 0.360596, -0.658691, -0.098999, ... (448 more)] Statistics: min=0.229682, max=1.468703, mean=0.922048, sum=472.088684 weight_tensor Data: Type: ValueSpec(type=Tensor, sizes=[128, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM) Total elements: 16384 Data (first 64 elements): [-0.769531, -0.006275, 0.218018, -0.793945, -0.732910, -0.702148, -0.518555, -0.655273, -0.345703, 0.622070, 0.718262, -0.850098, 0.332031, -0.909668, 0.082275, 0.224243, -0.941895, 0.189575, 0.467285, -0.498535, -0.210083, -0.705078, 0.604004, -0.977051, -0.490967, -0.063721, -0.885742, 0.909180, 0.732910, 0.111511, -0.557617, 0.299316, -0.189941, 0.161743, -0.367676, -0.662109, -0.846191, -0.168213, 0.686035, 0.305908, 0.697754, 0.241089, 0.942871, -0.192383, -0.229126, 0.747070, 0.908691, 0.110107, -0.108459, -0.142090, 0.339355, -0.720215, -0.834961, -0.539062, 0.793945, -0.605469, -0.403809, 0.596680, -0.475342, -0.335205, -0.989258, -0.157715, 0.086365, -0.110352, ... (16320 more)] Statistics: min=0.013550, max=1.468764, mean=0.913034, sum=14959.146484 bias_tensor Data: Type: ValueSpec(type=Tensor, sizes=[128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM) Total elements: 128 Data (first 64 elements): [0.669434, -0.134888, -0.790039, -0.936035, 0.489258, 0.255615, -0.278809, -0.630859, -0.281250, -0.407227, 0.218384, -0.485840, -0.212402, 0.359619, -0.181763, -0.434814, 0.019791, 0.833496, 0.420166, -0.583496, 0.920898, 0.262939, -0.086731, 0.597168, -0.144653, 0.065247, -0.772949, -0.650391, -0.563965, -0.645020, 0.914551, 0.467285, 0.886230, -0.200317, 0.763184, 0.024109, 0.292725, -0.917969, -0.572266, -0.051605, 0.273438, -0.978027, -0.721680, 0.602051, -0.082581, -0.718262, 0.747559, -0.410889, -0.482910, 0.708496, 0.329590, 0.599609, 0.725098, -0.226318, -0.702148, 0.677734, 0.125854, -0.276367, -0.681641, 0.816406, -0.653809, -0.542480, -0.791504, -0.032104, ... (64 more)] Statistics: min=0.289590, max=1.467422, mean=0.940261, sum=120.353394 Executing 1 test cases for FPLinear ---------------------------------------------------------------------- ==================== Shared Object List ==================== idx sizes users 0 16 [7,] ==================== Value List ============================ idx type sizes node_type storage_bytes so_idx 0 TENSOR [4, 128] INPUT 1024 1 STAGING 2 TENSORREF [128, 128] PREPACK 3 TENSORREF [128] PREPACK 4 INT 5 TENSOR [4, 128] OUTPUT 1024 6 TENSOR [128, 128] PREPACK 32768 7 TENSOR [] 16 0 8 TENSOR [128] PREPACK 256 9 STAGING ==================== Prepack Node List ===================== idx shader_name tref packed 0pack_fp_linear_weight_buffer_half 2 6 1 nchw_to_buffer_half_half 3 8 ==================== Execute Node List ===================== idx shader_name READ_arg WRITE_arg 0 nchw_to_buffer_half_float [1,] [0,] 1 linear_tiled_nv_cm2_half [0,6,8,] [5,] 2 buffer_to_nchw_half_float [5,] [9,] Output[0] Data: Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS) Total elements: 512 Data (first 20 elements): [-1.498047, -3.060547, -0.001218, -2.798828, 1.934570, 2.558594, 3.414062, 5.820312, -2.109375, -3.806641, 3.628906, 0.022491, -1.992188, -1.054688, 0.149658, -8.593750, -0.514648, 8.828125, 3.468750, -1.464844, ... (492 more)] Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243 Output[0] (ref) Data: Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS) Total elements: 512 Data (first 20 elements): [-1.499553, -3.061874, -0.002150, -2.799306, 1.934955, 2.555415, 3.412702, 5.819950, -2.108622, -3.801892, 3.630069, 0.021923, -1.991505, -1.054845, 0.149873, -8.598531, -0.516323, 8.826857, 3.467027, -1.466808, ... (492 more)] Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243 Correctness validation PASSED linear_tiled_nv_cm2_half ACCU B=4 I=128 O=128 Buf fp16+bias L [4x128] 5.920 ╬╝s 11.070 GFLOP/s PASSED ---------------------------------------------------------------------- ```` Differential Revision: [D91945037](https://our.internmc.facebook.com/intern/diff/D91945037/) [ghstack-poisoned]

pytorch-bot · 2026-02-20T06:00:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17581

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit ab6b3da with merge base 0c87468 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for backends/vulkan/test/custom_ops/test_fp_linear.cpp:
pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t d13fd68cda94a2fec01343f52cbf9c7b5e2f2f12868f79736503314954570c79 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-20T06:02:07Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

HarryHu-art requested a review from SS-JIA as a code owner February 20, 2026 06:00

This was referenced Feb 20, 2026

[ET-VK] FP Linear benchmark + test op #17579

Open

[ET-VK] Utility file for NVidia extensions #17580

Open

[ET-VK] Quantized linear layer with NV cooperative matrix #17582

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 20, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[ET-VK] Experimental FP Linear Implementation with NV cooperative matrix 2 extension#17581

[ET-VK] Experimental FP Linear Implementation with NV cooperative matrix 2 extension#17581
HarryHu-art wants to merge 1 commit intogh/HarryHu-art/4/basefrom
gh/HarryHu-art/4/head

HarryHu-art commented Feb 20, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

HarryHu-art commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17581

❌ 2 New Failures

Uh oh!

github-actions bot commented Feb 20, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HarryHu-art commented Feb 20, 2026 •

edited

Loading

pytorch-bot bot commented Feb 20, 2026 •

edited

Loading

This PR needs a `release notes:` label