Skip to content

Conversation

@ChipKerchner
Copy link
Contributor

@ChipKerchner ChipKerchner commented Feb 10, 2026

Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.

Testing LLVM FP16 LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         24948910

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         18968190

Accumulation differences are about 4X the widening (previous) version. But the performance it up to 2.7X faster - Note: BananaPi shows only 1.85X faster.

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 10, 2026

Unfortunately BF16 only has widening MADD instructions. So the same changes can not be made for BF16.

@ChipKerchner
Copy link
Contributor Author

These are for VLEN = 256 only currently

@ChipKerchner
Copy link
Contributor Author

It now works for VLEN = 128.

@ChipKerchner ChipKerchner marked this pull request as draft February 10, 2026 22:04
@ChipKerchner
Copy link
Contributor Author

Main loop now uses LMUL = 2

@ChipKerchner ChipKerchner marked this pull request as ready for review February 11, 2026 00:38
@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 11, 2026

Even faster!!!

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         13400067

@ChipKerchner
Copy link
Contributor Author

Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant