Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Segmentation fault if libamd_comgr.*.so cannot be loaded and an RTCProgram is created #144

Open
stellaraccident opened this issue Feb 26, 2025 · 1 comment

Comments

@stellaraccident
Copy link

I found this issue while working with a non-standard directory layout, attached a debugger, and found the root cause of the segfault.

This code in clr/hipamd/hiprtc/hiprtcInternal.cpp:

RTCProgram::RTCProgram(std::string name) : name_(name) {
  constexpr bool kComgrVersioned = true;
  std::call_once(amd::Comgr::initialized, amd::Comgr::LoadLib, kComgrVersioned);
  if (amd::Comgr::create_data_set(&exec_input_) != AMD_COMGR_STATUS_SUCCESS) {
    crashWithMessage("Failed to allocate internal hiprtc structure");
  }
}

is not checking the return value from amd::Comgr::LoadLib(). The result is that when the subsequent call to create_data_set is issued, we get a crash on due to a nullptr dereference:

#0  0x0000000000000000 in ?? ()
#1  0x000073a32a0ffd59 in amd::Comgr::create_data_set (data_set=0x73a28c045198) at /therock/src/sources/clr/rocclr/device/comgrctx.hpp:205
#2  hiprtc::RTCProgram::RTCProgram (this=0x73a28c045120, name=...) at /therock/src/core/clr/hipamd/src/hiprtc/hiprtcInternal.cpp:62
#3  0x000073a32a103b7c in hiprtc::RTCCompileProgram::RTCCompileProgram (this=0x73a28c045120, name_=...)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/basic_string.h:1070

Instead of just relying on a bool return value from LoadLib, which is easy to miss (especially with call_once), maybe add an additional argument abort_on_failure and put a crashWithMessage inside the LoadLib function. These kind of delay load situations are basically unrecoverable, and it is better to crash with an error message/information than just segfault on a null dereference.

@ppanchad-amd
Copy link

Hi @stellaraccident. Internal ticket has been created to investigate this issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants