Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61089

MCP upgrades with infra coredns static pod and rpm-ostree race condition can sometimes get stuck

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • Moderate
    • None
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • Hide
      Added a retry in the MCO OS update operation to workaround occasional network errors due to coredns pod restarts. Previously, updates to coredns templates would restart the static pod, causing a race where the subsequent OS update via rpm-ostree would fail the image pull due to network errors and stall.
      Show
      Added a retry in the MCO OS update operation to workaround occasional network errors due to coredns pod restarts. Previously, updates to coredns templates would restart the static pod, causing a race where the subsequent OS update via rpm-ostree would fail the image pull due to network errors and stall.
    • None
    • None
    • None
    • None

      When the Machine config daemon is applying 4.16 manifests the coredns static pod yaml is getting upgraded however this causes the pod to redeploy and the rpm-ostree subsequently fails to perform DNS lookups—causing all upgrades to halt indefinitely. (requires IPI install on platform using the mcd templated coredns static pods)

      Intermittently when performing an upgrade to 4.16 on VMware infra.
      occurs in roughly %50 of nodes. The customer has experienced this issue on 10 clusters

      E1011 08:09:28.338992 29910 writer.go:226] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02e1321a6afc7edcfe476869816af39e598762ea125caf16fa5c1d3a536aac4e : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02e1321a6afc7edcfe476869816af39e598762ea125caf16fa5c1d3a536aac4e: error: Creating importer: Failed to invoke skopeo proxy method OpenImage: remote error: (Mirrors also failed: [xxx.xxx.xxx.xxx:5009/openshift-release-dev/ocp-v4.0-art-dev@sha256:02e1321a6afc7edcfe476869816af39e598762ea125caf16fa5c1d3a536aac4e: pinging container registry xxx.xxx.xxx.xxx:5009: Get "https://xxx.xxx.xxx.xxx:5009/v2/": dial tcp: lookup xxx.xxx.xxx.xxx on [::1]:53: read udp [::1]:46362->[::1]:53: read: connection refused]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02e1321a6afc7edcfe476869816af39e598762ea125caf16fa5c1d3a536aac4e: pinging contai...
      

      Steps to reproduce:

      1.  Pause the MCP
      2. Apply the MC which will trigger the coredns pod with  osImageURL
      3. Unpause the MCP
      4. MCP should complete its update without getting stuck

              team-mco Team MCO
              rh-ee-ptalgulk Prachiti Talgulkar
              None
              None
              Prachiti Talgulkar Prachiti Talgulkar
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: