-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
False
-
-
False
-
-
Goal:
Each accelerator has a different set of commands and env variables for debugging. Let's collect the various debugging tricks and put them into a team-wide document, so dev and QE have an easy way to find the information.
- how to access stage registry
- bootc switch, podman login + copy auth.json for ostree
- skopeo list-tags
- ilab system info
- torch API (torch.cuda)
- system management interface: nvidia-smi, rocm-smi, hl-smi
- system info: rocminfo
- device to ISA / compute level mapping
- env vars for extended logging and debug mode:
- https://docs.habana.ai/en/latest/PyTorch/Reference/Debugging_Guide/Debugging_with_Intel_Gaudi_Logs.html
- https://docs.habana.ai/en/latest/PyTorch/Reference/Runtime_Flags.html
- https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/debugging.html
- https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/logging.html
- introspect
- roc-obj-ls to figure out supported ISAs of a shared library