Add half::f16 and half::bf16 support#1551
Add half::f16 and half::bf16 support#1551swfsql wants to merge 5 commits intorust-ndarray:masterfrom
half::f16 and half::bf16 support#1551Conversation
| (m127, 127, 127, 127) // ~128x slower than f32 | ||
| (mix16x4, 32, 4, 32) | ||
| (mix32x2, 32, 2, 32) | ||
| // (mix10000, 128, 10000, 128) // too slow |
There was a problem hiding this comment.
You mention that f16 is slower in the issue and several times here. Is it faster on some operations? If not, why do you (and others) want to use it? Only to save space?
There was a problem hiding this comment.
Yes, on my arch (x86_64) it is indeed quite slow. They mention on their docs that aarch64 has support for all operations (for half::f16), and it is possibly viable in performance for that arch -- but I haven't tested it. They also have specialized methods for storing/loading the half::{f16, bf16} data to/from other types (u16, f32, f64) which could improve performance also, but I didn't leverage those operations when including the types to ndarray (I don't really know if/how they could be leveraged).
Albeit (at least for me) it is a bit disappointing that it is slow, I find it useful for debugging fp16 models (for machine learning), given that some architectures behave poorly on fp16 and it is easier to debug them if everything is running on the cpu.
For wasm targets, some builds may not enable cpu features, and thus the performance of f32 and half::f16 should be closer -- so in that case I believe the memory savings could be meaningful, but I believe that is a niche target.
I don't know if proper simd support is possible for fp16, it appears that some work towards this is active (for the primitive f16).
With that being said, I still solicit for half::{f16, bf16} support on ndarray. That makes development and debugging smoother, even if fp16 doesn't have proper simd support -- given that the crux of the training happens on the gpus. In the future it is possible to both have simd improvements from the underlying half, or from a new addition or replacement into a primitive f16 type.
I also understand that ndarray takes performance in high regard, thus possibly opting for a delay or a non-inclusion of the fp16 types.
There was a problem hiding this comment.
I just tested, and I'm seeing similarly poor performance on my aarch64 machine (an Apple M2 chip). Running cargo +nightly bench -F half mat_mul, m127 is ~346x slower for f16 than f32, and ~110x slower for bf16 than f32. My guess after some debugging is that this is due to the f32 and f64 implementations dispatching to matrixmultiply, since I can clearly see that I am hitting the f16 assembly code on my computer.
Given how small this change is to the overall library, I'm still in favor of including this code, but we may want to put in an issue with matrixmultiply to see if they are willing to support f16.
There was a problem hiding this comment.
Thanks for testing and debugging this.
On aarch64, could you please try the bench again but adding the nightly feature for the half dependency, while still running it on nightly? I believe changing the half features to features = ["num-traits", "nightly"] should do it. I'm curious if this feature could bring an improvement for that arch.
| (m127, 127, 127, 127) // ~128x slower than f32 | ||
| (mix16x4, 32, 4, 32) | ||
| (mix32x2, 32, 2, 32) | ||
| // (mix10000, 128, 10000, 128) // too slow |
There was a problem hiding this comment.
I just tested, and I'm seeing similarly poor performance on my aarch64 machine (an Apple M2 chip). Running cargo +nightly bench -F half mat_mul, m127 is ~346x slower for f16 than f32, and ~110x slower for bf16 than f32. My guess after some debugging is that this is due to the f32 and f64 implementations dispatching to matrixmultiply, since I can clearly see that I am hitting the f16 assembly code on my computer.
Given how small this change is to the overall library, I'm still in favor of including this code, but we may want to put in an issue with matrixmultiply to see if they are willing to support f16.
akern40
left a comment
There was a problem hiding this comment.
Actually, before we merge, would you mind adding a note about the performance implications to the README?
|
I've just added a note. Please let me know if there other improvements for this PR, I'm happy to change it if I can. |
Uh oh!
There was an error while loading. Please reload this page.