Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56994

Machine Config Daemon on HCP never retries after failure

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • None
    • None
    • None
    • None
    • None
    • MCO Sprint 277
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      OCP 4.18 HUB cluster fails with error where the component running rpm-ostree on HCP worker nodes do not retry when encountering an issue, causing the cluster to get stuck during the upgrade process from a single failed image pull. The customer when using a pull-through registry cache, the first image pull for the first node timeout, but never retries. resulting in the entire node pool become stuck.

      In addition due to unnecessarily  transient nature of the MCD pods in hypershift it is impossible to capture pod logs for diagnosis making this issue extremely difficult to diagnose with very little gained by this feature.

      Upon troubleshooting, it was noted that the `machineconfig` daemon `pod` is not always running like in a regular cluster. Instead, when the desired `mcd` config is different than current config, we deploy the `pod` onto the worker `node` to do the upgrade in place.

      Please refer to this.

      The inspect of the upgrade namespace from the hosted cluster is available here.

      Please also refer to MCO-358, as it seems to be related to this request.

              djoshy David Joshy
              rhn-support-rdeolive Rafael de Oliveira Rosa
              None
              None
              None
              None
              Tim Dawson
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: