Shuffle scalable vector in CodeGen_ARM by stevesuzuki-arm · Pull Request #8898 · halide/Halide

stevesuzuki-arm · 2025-12-11T19:27:47Z

By design, LLVM shufflevector doesn't accept scalable vectors.
So, we try to use llvm.vector.xx intrinsic where possible.
However, those are not enough to cover wide usage of shuffles in Halide.
To handle arbitrary index pattern, we decompose a shuffle operation
to a sequence of multiple native shuffles, which are lowered to
Arm SVE2 intrinsic TBL or TBL2.

Another approach could be to perform shuffle in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, it seems to be only possible via load/store memory,
which would presumably be poor performance.

This change also includes:

Peep-hole the particular predicate pattern to emit WHILELT instruction
Shuffle 1bit type scalable vectors as 8bit with type casts
Peep-hole concat_vectors for padding to align up vector
Fix redundant broadcast in CodeGen_LLVM

Depends on:

Support strided Load/Store in SVE2 #8888

stevesuzuki-arm · 2025-12-11T19:41:31Z

stevesuzuki-arm · 2025-12-11T20:10:25Z

The CI test failure below is a known issue which should be fixed by #8888.

st2w_int32_x8                   (arm-64-linux-no_neon-sve2-vector_bits_256)
StartAssertion failed: (!isScalable() || isZero()) && "Request for a fixed element count on a scalable object", file C:\build_bot\worker\llvm-main-x86-32-windows\llvm-project\llvm\include\llvm/Support/TypeSize.h, line 202

I will rebase once #8888 is merged.

Theoretically, these are llvm common and not ARM specific, but for now, keep it for ARM only to avoid any affect to other targets.

The workaround of checking wide_enough in get_vector_type() was causing the issue of mixing FixedVector and ScalableVector in generating a intrinsic instruction in SVE2 codegen. By this change, we select scalable vector for most of the cases. Note the workaround for vscale > 1 case will be addressed in a separate commit.

By design, LLVM shufflevector doesn't accept scalable vectors. So, we try to use llvm.vector.xx intrinsic where possible. However, those are not enough to cover wide usage of shuffles in Halide. To handle arbitrary index pattern, we decompose a shuffle operation to a sequence of multiple native shuffles, which are lowered to Arm SVE2 intrinsic TBL or TBL2. Another approach could be to perform shuffle in fixed sized vector by adding conversion between scalable vector and fixed vector. However, it seems to be only possible via load/store memory, which would presumably be poor performance. This change also includes: - Peep-hole the particular predicate pattern to emit WHILELT instruction - Shuffle 1bit type scalable vectors as 8bit with type casts - Peep-hole concat_vectors for padding to align up vector - Fix redundant broadcast in CodeGen_LLVM

Modified codegen of vector broadcast in SVE2 to emit TBL ARM intrin instead of llvm.vector.insert. Fix performance test failure of nested_vectorization_gemm

alexreinking · 2026-02-15T04:29:50Z

Rebased on main to integrate #8888

alexreinking · 2026-02-15T05:48:06Z

@stevesuzuki-arm -- looks like there are some real LLVM integration errors.

- Fix to cover vector_bits > 128 case appropriately - Add test target with vector_bits = 512 - W is increased to 512 because Bounds given for op_st4b_int8_x512_0 did not cover required region - Refine target compatibility check for can_run_code

stevesuzuki-arm · 2026-02-16T19:47:27Z

I think both #8950 and #8898 are necessary to fix llvm crash in old llvm version. Let's see CI results.

alexreinking · 2026-02-16T21:30:47Z

I think both #8950 and #8898 are necessary to fix llvm crash in old llvm version. Let's see CI results.

Can one be merged first, or should we join them together as one PR? After all the changes I made to CI, we're trying to adopt a policy of only merging green PRs.

stevesuzuki-arm · 2026-02-16T21:49:57Z

This PR has already incorporated #8950, so if the on-going CI result is OK, we should close #8950 without merging and keep this as combined one.

alexreinking

See in-line comments for my smaller concerns.

I'm not sure DecomposeVectorShuffle is the right design. I think it's needlessly coupling computing the shuffle plan (which doesn't depend on the codegen backend) with generating the code for the plan (which does).

src/DecomposeVectorShuffle.cpp

src/DecomposeVectorShuffle.h

src/CodeGen_ARM.cpp

src/DecomposeVectorShuffle.h

src/CMakeLists.txt

test/correctness/simd_op_check_sve2.cpp

alexreinking · 2026-02-17T03:50:45Z

If it's okay with you, @stevesuzuki-arm, I can perform the refactoring I requested myself. I'll just push to this branch.

stevesuzuki-arm · 2026-02-17T10:13:30Z

If it's okay with you, @stevesuzuki-arm, I can perform the refactoring I requested myself. I'll just push to this branch.

Thank you for reviewing this large set of updates. I really appreciate it.
It’s totally fine with me if you’d like to go ahead and perform the refactoring and push to this branch.

alexreinking · 2026-02-17T12:15:53Z

Okay, I made all the changes I requested to DecomposeVectorShuffle... now I will look at CodeGen_ARM 🙂

# Conflicts: # src/CodeGen_ARM.cpp

alexreinking

I'd like a little bit more testing, I think. Can you point to a test that covers the new codegen without needing to run SVE2?

alexreinking · 2026-02-17T14:06:38Z

test/correctness/simd_op_check_sve2.cpp

 class SimdOpCheckArmSve : public SimdOpCheckTest {
 public:
-    SimdOpCheckArmSve(Target t, int w = 384, int h = 32)
+    SimdOpCheckArmSve(Target t, int w = 512, int h = 16)


Does anything in these tests check that the expected tbl, tbl2, whilelt, etc. instructions are generated?

I'm afraid not. Changes around vector shuffle were driven by fixing the failures of other Halide unite tests where we need to handle more complex vectorization.

Test for tbl is added in simd_op_check_sve2 by other downstream commit for dynamic_shuffle, although it has not been pushed yet.

Can we get even one simple case that exercises the new codegen paths in?

Not in simd_op_check_sve2 but in correctness_interleave[1] I've just found a bug which is unveiled by the refactoring.
In case the caller of shuffle_vectors() set indices of -1, steps_for_dst_slice in shuffle_plan may be empty, which ends up with calling concat_vectors with null element.
Possible solution:
a) Initialize dst_slice as undef value
and/or
b) Avoid empty steps_for_dst_slice in decomposition. i.e. it must have at least one step even if that produces only undef/poison

[1] sz = 27 case in

Halide/test/correctness/interleave.cpp

Line 355 in 1877f41

for (int sz : {8, 27, 256}) {

This is a compilation crash, so I suppose we don't need SVE2 host to reproduce.

About the simple case,
whilelt can be tested by st1b_int8_x8 case in simd_op_check_sve2.
tbl can be tested by correctness_vector_shuffle

Will you please add those test cases and the fix for the interleave test?

I've pushed the fix for the interleave test.
Just to clarify, are you suggesting to add dedicated test cases for tbl and whilelt in simd_op_check_sve2?

Yes, I think the last thing missing here is something that verifies that tbl and whilelt are reachable from the front-end.

I have added tbl and whilelt test cases

test/correctness/decompose_vector_shuffle.cpp

In case the caller of shuffle_vectors() set indices of -1, steps_for_dst_slice in shuffle_plan may be empty, which ended up with calling concat_vectors with null element. The fix is to create undef vector for that particular case. The original issue was found in correctness_interleave (transposition with sz=27).

stevesuzuki-arm · 2026-03-11T17:59:39Z

@alexreinking Would you tell me if you have any suggestion for me to do in order to move this forward?

alexreinking · 2026-03-11T18:00:57Z

Let's rebase this on main. We now have SVE2 testing through GHA and I'm inclined to accept if green.

alexreinking · 2026-03-13T14:26:51Z

@stevesuzuki-arm -- Please check to see if the LLVM 21 failures can be easily solved. Otherwise, skip the failing tests on LLVM <22 + SVE2

stevesuzuki-arm · 2026-03-13T14:31:14Z

@stevesuzuki-arm -- Please check to see if the LLVM 21 failures can be easily solved. Otherwise, skip the failing tests on LLVM <22 + SVE2

CI failures with llvm-21 are as follows:

correctness_interleave
LLVM ERROR: Don't know how to widen the operands for INSERT_SUBVECTOR
llvm/llvm-project#160134
llvm/llvm-project#169300

correctness_stmt_to_html
tutorial_lesson_05_scheduling_1
Request for a fixed element count on a scalable object
llvm/llvm-project#160127

correctness_predicated_store_load_single_lane
LLVM ERROR: Unable to widen vector store
llvm/llvm-project#54424

Making Workaround in codegen would be tricky, so I'd prefer "we should just say we only support sve2 from llvm 22 up".

alexreinking · 2026-03-13T14:36:29Z

Making Workaround in codegen would be tricky, so I'd prefer "we should just say we only support sve2 from llvm 22 up".

Works for me. Skip the tests for now. In a follow-up PR, we can add a user_error for trying to compile SVE2 code with LLVM 21. That will also require not detecting SVE2 as part of host when linked against LLVM 21, so it's a larger change we should defer.

stevesuzuki-arm · 2026-03-15T18:01:39Z

@alexreinking Can we merge this now?

alexreinking · 2026-03-15T18:13:44Z

Works for me. Skip the tests for now. In a follow-up PR ...

@alexreinking Can we merge this now?

I was waiting for you to add [SKIP] prints to the failing tests.

The following tests FAILED:
	219 - correctness_interleave (Subprocess aborted)       correctness
	283 - correctness_predicated_store_load_single_lane (Subprocess aborted) correctness
	336 - correctness_stmt_to_html (Subprocess aborted)     correctness
	713 - tutorial_lesson_05_scheduling_1 (Subprocess aborted) multithreaded tutorial

Something like this:

    if (Internal::get_llvm_version() < 220 &&
        get_jit_target_from_environment().has_feature(Target::SVE2)) {
        printf("[SKIP] LLVM %d has known SVE backend bugs for this test.\n",
               Internal::get_llvm_version());
        return 0;
    }

stevesuzuki-arm · 2026-03-15T18:19:19Z

Thank you for clarifying. I thought the job would be skipped in case of LLVM21. OK, I will do what you suggested.

SVE2 backend has some LLVM issues which have been fixed in LLVM 22. For now, we skip the tests which fail due to those issues in case LLVM version < 22.

alexreinking · 2026-03-15T21:59:45Z

Just merged main into this branch for the new pre-commit tooling. Will merge when green. Thank you so much for your patience and valuable contributions, @stevesuzuki-arm 🎉

mcourteaux · 2026-03-16T17:32:13Z

We might want to hold of on merging this, as this is the type of PR that can bring us a lot of headache. I think it'd be wise to first fully implement our goals in #9044, integrate those strategies here.

alexreinking requested a review from halidebuildbots December 11, 2025 19:35

stevesuzuki-arm force-pushed the pr-shuffle_sve2 branch from 4a40326 to 9c9e621 Compare December 15, 2025 11:30

stevesuzuki-arm added 6 commits February 14, 2026 23:29

Add helpers for shuffle operations of scalable vector

9772743

Move helpers for shuffle scalable vectors to CodeGen_ARM

5793e89

Theoretically, these are llvm common and not ARM specific, but for now, keep it for ARM only to avoid any affect to other targets.

Add DecomposeVectorShuffle to Makefile

85ef5b5

Improve performance of vector broadcast in SVE2

84ec0ee

Modified codegen of vector broadcast in SVE2 to emit TBL ARM intrin instead of llvm.vector.insert. Fix performance test failure of nested_vectorization_gemm

alexreinking force-pushed the pr-shuffle_sve2 branch from 9c9e621 to 84ec0ee Compare February 15, 2026 04:29

alexreinking requested changes Feb 17, 2026

View reviewed changes

alexreinking reviewed Feb 17, 2026

View reviewed changes

src/CMakeLists.txt Show resolved Hide resolved

alexreinking reviewed Feb 17, 2026

View reviewed changes

test/correctness/simd_op_check_sve2.cpp Outdated Show resolved Hide resolved

Simplify DecomposeVectorShuffle implementation; move test to tests/

e3ae7ab

stevesuzuki-arm mentioned this pull request Feb 17, 2026

Modify test cases of total bit width in simd_op_check_sve2 #8950

Closed

alexreinking added 3 commits February 17, 2026 06:52

Remove extra header in decompose_vector_shuffle.cpp

190e032

Add an assertion that highest_lane is a real lane.

f671968

Assert that dst_slice has a value

9d536a8

alexreinking added 2 commits February 17, 2026 07:25

Don't assert on paths that don't require alignment

ec222f7

Assert arguments to shuffle_scalable_vectors_general are valid

6cfb39e

alexreinking added 4 commits February 17, 2026 07:29

Fix typo in comment in CodeGen_ARM::visit(Shuffle)

c58006e

Eliminate need for DecomposeVectorShuffle wrappers

f35a8a8

Drop old function prototypes

562ac36

# Conflicts: # src/CodeGen_ARM.cpp

Remove unused using std::optional

26809ed

alexreinking requested changes Feb 17, 2026

View reviewed changes

stevesuzuki-arm commented Feb 17, 2026

View reviewed changes

test/correctness/decompose_vector_shuffle.cpp Outdated Show resolved Hide resolved

mcourteaux mentioned this pull request Mar 11, 2026

Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass #8629

Open

stevesuzuki-arm added 3 commits March 11, 2026 22:13

Merge branch 'main' into pr-shuffle_sve2

bc284e0

Add "tbl" and "whilelt" in simd_op_check_sve2

acb6f5b

Merge branch 'main' into pr-shuffle_sve2

be0b4a6

alexreinking approved these changes Mar 13, 2026

View reviewed changes

stevesuzuki-arm and others added 2 commits March 15, 2026 20:38

Skip failing tests on SVE2 target with LLVM21

2d43f3d

SVE2 backend has some LLVM issues which have been fixed in LLVM 22. For now, we skip the tests which fail due to those issues in case LLVM version < 22.

Merge branch 'main' into pr-shuffle_sve2

b4572b7

Skip performance_boundary_conditions on SVE2 with LLVM 21

80087d5

Conversation

stevesuzuki-arm commented Dec 11, 2025 • edited by alexreinking Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevesuzuki-arm commented Dec 11, 2025

Uh oh!

stevesuzuki-arm commented Dec 11, 2025

Uh oh!

alexreinking commented Feb 15, 2026

Uh oh!

alexreinking commented Feb 15, 2026

Uh oh!

stevesuzuki-arm commented Feb 16, 2026

Uh oh!

alexreinking commented Feb 16, 2026

Uh oh!

stevesuzuki-arm commented Feb 16, 2026

Uh oh!

alexreinking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexreinking commented Feb 17, 2026

Uh oh!

stevesuzuki-arm commented Feb 17, 2026

Uh oh!

alexreinking commented Feb 17, 2026

Uh oh!

alexreinking left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevesuzuki-arm Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevesuzuki-arm commented Mar 11, 2026

Uh oh!

alexreinking commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexreinking commented Mar 13, 2026

Uh oh!

stevesuzuki-arm commented Mar 13, 2026

Uh oh!

alexreinking commented Mar 13, 2026

Uh oh!

stevesuzuki-arm commented Mar 15, 2026

Uh oh!

alexreinking commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevesuzuki-arm commented Mar 15, 2026

stevesuzuki-arm commented Dec 11, 2025 •

edited by alexreinking

Loading

stevesuzuki-arm Feb 17, 2026 •

edited

Loading

alexreinking commented Mar 11, 2026 •

edited

Loading

alexreinking commented Mar 15, 2026 •

edited

Loading