Uploaded image for project: 'Hybrid Cloud Infrastructure Documentation'
  1. Hybrid Cloud Infrastructure Documentation
  2. HCIDOCS-316

Add information about Metal3 database to bare-metal IPI troubleshooting

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • Metal
    • None
    • HCIDOCS 2024#8, HCIDOCS 2024#9
    • 2
    • Low

      We have noticed a couple of things that our customers try to do without fully understanding implications of such actions:

      1. Removing finalizers from BareMetalHost resources that take a long time to get deleted.
      2. Deleting the Metal3 pod as universal troubleshooting step.

      The latter is questionable, the former is outright harmful. I wonder if we can expand the customer documentation with a warning. I believe it belongs in installing/installing_bare_metal_ipi/ipi-install-troubleshooting.html but I'm open to better ideas.

      The note could go along these lines (I'm not a good writer, apologies):

      Problem: Deleting BareMetalHost resources take a long time
      
      Explanation: When a BareMetalHost resource is deleted (e.g. as part of a MachineSet scale-down), the bare-metal machine is first deprovisioned, which involves powering it off, booting a service ramdisk on it and removing partitioning metadata from all disks in the process known as "cleaning". Finally, the machine is powered off again. If something goes wrong during this process, e.g. because the bare-metal machine is not behaving correctly and cannot boot, the deletion will stuck for a very long time.
      
      Solution: If the issue is recoverable, it is recommended to wait for the process to finish. If cleaning cannot possibly succeed, disable it by modifying the BareMetalHost to set the automatedCleaningMode field to "disabled".
      
      Warning: Never remove the finalizer to force the deletion of a BareMetalHost! The provisioning back-end has its own database, which will keep a host record if you force-delete the BareMetalHost resource. Any already running actions on the host will continue regardless of the deletion, and you may face unexpected issues on an attempt to re-add the host later.
      

      I'm not sure how and where to insert information about Metal3 pod deletion. Essentially, it wipes the internal database and causes it to be re-created from the information in BareMetalHost resources once the pod is automatically re-created. This is a valid last resort measure, but the customers need to understand its implications. Most importantly: any running provisioning processes will be aborted and started from scratch. Maybe we should just recommend NOT doing that... Ideas welcome.

              rhn-support-jowilkin John Wilkins
              rhn-engineering-dtantsur Dmitry Tantsur
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: