In the past several weeks I have been porting a codebase from Nvidia CUDA platform to AMD HIP. Several critical issues were encountered, some solved, some attributed to compiler bugs, some remaining unfathomable to me.
There are 3 important things I have learned so far from the painstaking debugging process.
- An unsigned integer with n bits only allows 0~(n-1) times bitwise left shift (
<<). Excess shifts lead to undefined behavior. For Nvidia platform, 0 bit will be added, whereas for AMD, 1 bit will be added!!!
- Currently there is a serious compiler bug: the wavefront vote function
__any(pred), which is supposed to work like
__any_sync(__activemask(), pred)in CUDA, yields incorrect result in divergent threads!!!
- This is very easy to miss: the parameter of wavefront vote functions
__all(pred), etc is a 32-bit integer for both Nvidia and AMD platforms. If, however, a 64-bit integer is passed to the function, higher bits will be truncated!!! The solution is to explicitly cast the 64-bit integer to bool, which is then implicitly cast to int.