[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146

GMNGeoffrey · 2025-03-11T18:19:07Z

Problem Description

The argument to some GPU functions is frequently only initialized in some threads. Clang created the maybe_undef annotation for precisely this use case. Calling cooperative_groups::thread_block_tile_base::shfl with a maybe-uninitialized argument results in a miscompile (at least vs the user intent. Not sure if the library or the compiler is at fault) where the compiler assumes that all threads have to take a branch where the variable is initialized.

Operating System

22.04.5 LTS (Jammy Jellyfish)

CPU

AMD Ryzen Threadripper PRO 7975WX 32-Cores

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.2.4

ROCm Component

clr

Steps to Reproduce

#include "hip/hip_runtime.h"
#include <hip/hip_cooperative_groups.h>

namespace cg = cooperative_groups;

#define HIP_CHECK(val) \
  { hip_check_((val), __FILE__, __LINE__); }

class HipException : public std::runtime_error {
 public:
  HipException(const std::string& what) : runtime_error(what) {}
};

inline void hip_check_(hipError_t val, const char* file, int line) {
  if (val != hipSuccess) {
    throw HipException(std::string(file) + ":" + std::to_string(line) + ": HIP error " +
                        std::to_string(val) + ": " + hipGetErrorString(val));
  }
}

template <int warp_size>
__global__ void get_kernel(size_t* d_block_count) {
  cg::thread_block_tile<warp_size> warp_tile =
      cg::tiled_partition<warp_size>(cg::this_thread_block());
  const size_t lane_idx = warp_tile.thread_rank();

  size_t warp_position;
  // size_t warp_position = 0;
  printf(
      "Before branch: lane_idx=%zu, recomputed_lane_idx=%u, d_block_count=%zu\n",
      lane_idx, warp_tile.thread_rank(), *d_block_count);
  if (lane_idx == 0) {
    warp_position = atomicAdd(d_block_count, 1);
    printf(
      "Incremented d_block_count: lane_idx=%zu, recomputed_lane_idx=%u, d_block_count=%zu, warp_position=%zu\n",
      lane_idx, warp_tile.thread_rank(), *d_block_count, warp_position);
  }
  warp_position = warp_tile.shfl(warp_position, 0);
  // warp_position used as index into an array or something
}

int main(int argc, char** argv) {
  hipStream_t stream;
  size_t h_block_count;
  size_t* d_block_count;

  HIP_CHECK(hipMalloc((void**)&d_block_count, sizeof(size_t)));
  HIP_CHECK(hipStreamCreate(&stream));

  HIP_CHECK(hipStreamSynchronize(stream));

  constexpr int warp_size = 64;
  constexpr int block_count = 1;
  constexpr int block_size = 64;
  get_kernel<warp_size><<<block_count, block_size, 0, stream>>>(d_block_count);

  HIP_CHECK(hipMemcpyAsync(&h_block_count, d_block_count, sizeof(size_t),
                              hipMemcpyDeviceToHost, stream));
  HIP_CHECK(hipStreamSynchronize(stream));
  printf("h_block_count=%zu\n", h_block_count);
  assert(h_block_count == block_count);

  HIP_CHECK(hipStreamDestroy(stream));
  HIP_CHECK(hipFree(d_block_count));
  return 0;
}

Execute with

#!/bin/bash

/opt/rocm/lib/llvm/bin/clang++ \
    -DHIP_ENABLE_WARP_SYNC_BUILTINS \
    -D__HIP_ROCclr__=1 \
    -g3 \
    -ggdb3 \
    --offload-arch=gfx90a \
    -fPIE \
    -O1 \
    -std=gnu++17 \
    -o miscompile_test.hip.o \
    -x hip \
    -c miscompile_test.hip


/opt/rocm/lib/llvm/bin/clang++ \
    -O0 \
    -g3 \
    -ggdb3 \
    --offload-arch=gfx90a \
    --hip-link \
    --rtlib=compiler-rt \
    -unwindlib=libgcc \
    miscompile_test.hip.o \
    -o miscompile_test \
    /opt/rocm/lib/libamdhip64.so.6.2.60204

./miscompile_test

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.10.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen Threadripper PRO 7975WX 32-Cores
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen Threadripper PRO 7975WX 32-Cores
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   5352
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            64
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    263382908(0xfb2e77c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    263382908(0xfb2e77c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    263382908(0xfb2e77c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90a
  Uuid:                    GPU-29273e62ca00396c
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   58112
  Internal Node ID:        1
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx90a
  Uuid:                    GPU-beaf2359215679f8
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   49920
  Internal Node ID:        2
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 4
*******
  Name:                    gfx90a
  Uuid:                    GPU-0c409370be5107fb
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    3
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   768
  Internal Node ID:        3
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 5
*******
  Name:                    gfx90a
  Uuid:                    GPU-3644c2b4f8c805bb
  Marketing Name:          AMD Instinct MI210
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    4
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29711(0x740f)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   1536
  Internal Node ID:        4
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

With normal C++ semantics, if warp_position is uninitialized, the only way it gets defined before it gets shuffled is if it enters the if block, so the compiler concludes that it has to have entered the if block and all threads do so.

The _shfl operations have __attribute__((maybe_undef)):

clr/hipamd/include/hip/amd_detail/amd_warp_functions.h

Lines 130 to 136 in 3c863da

    
           __device__ 
        
           inline 
        
           int __shfl(MAYBE_UNDEF int var, int src_lane, int width = warpSize) { 
        
               int self = __lane_id(); 
        
               int index = (src_lane & (width - 1)) + (self & ~(width-1)); 
        
               return __builtin_amdgcn_ds_bpermute(index<<2, var); 
        
           }

The groups-level version don't though:

clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h

Lines 451 to 474 in 3c863da

    
             /** \brief Shuffle operation on group level. 
        
              * 
        
              *  \details Exchanging variables between threads without use of shared memory. 
        
              *           Shuffle operation is a direct copy of ``var`` from ``srcRank``  
        
              *           thread ID of group. 
        
              * 
        
              *  \tparam T The type can be a 32-bit integer or single-precision 
        
              *            floating point. 
        
              *  \param var [in] The source variable to copy. Only the srcRank thread ID of 
        
              *                  group is copied to other threads. 
        
              *  \param srcRank [in] The source thread ID of the group for copy. 
        
              */ 
        
             template <class T> 
        
             __CG_QUALIFIER__ T shfl(T var, int srcRank) const { 
        
               static_assert(is_valid_type<T>::value, "Neither an integer or float type."); 
        
               srcRank = srcRank % static_cast<int>(size()); 
        
               int lane = (size() == __AMDGCN_WAVEFRONT_SIZE) ? srcRank 
        
                        : (__AMDGCN_WAVEFRONT_SIZE == 64)     ? __fns64(coalesced_info.member_mask, 0, (srcRank + 1)) 
        
                                                     : __fns32(coalesced_info.member_mask, 0, (srcRank + 1)); 
        
               return __shfl(var, lane, __AMDGCN_WAVEFRONT_SIZE); 
        
             }

I'm not really clear on whether that's something that would be propagated to the calling function.

As an aside, the documentation also says that the value argument can only be a 32-bit int or float, which is a bit weird, especially since it's templated and has a static_assert that just checks if it's integral or float, and delegates to the warp functions that support way more types than that. That documentation was just added. Seems maybe wrong? Changing warp_position to int32_t doesn't fix anything though.

I tried throwing MAYBE_UNDEF (the HIP Macro around the clang attribute) on thread_block_tile_base::shfl (confirmed that this is the one getting called via printf and the debugger), but that doesn't fix the issue.

Calling __shfl directly like __shfl(warp_position, 0, 64) also fixes the issue. I'm not sure whether this is a bug in ROCm or [hip] clang.

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2025-03-12T15:21:37Z

Hi @GMNGeoffrey. Internal ticket has been created to investigate this issue. Thanks!

GMNGeoffrey mentioned this issue Mar 11, 2025

Undefined variable leads to potential clang miscompile despite maybe_undef nod-ai/dgl#14

Open

ppanchad-amd added the Under Investigation label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146

[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146

GMNGeoffrey commented Mar 11, 2025

ppanchad-amd commented Mar 12, 2025

[Issue]: maybe_undef on cooperative_groups::thread_block_tile_base::shfl fails causing miscompile #146

[Issue]: maybe_undef on cooperative_groups::thread_block_tile_base::shfl fails causing miscompile #146

Comments

GMNGeoffrey commented Mar 11, 2025

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

ppanchad-amd commented Mar 12, 2025

[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146

[Issue]: `maybe_undef` on `cooperative_groups::thread_block_tile_base::shfl` fails causing miscompile #146