Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18964

The MCO can rarely degrade during installs if the Operator pod loses leader election before finishing the first sync

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • MCO Sprint 242
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      New description of problem:

      When the MCO installs, there is a small (<1% chance) of the first MCO pod losing leader election sometimes while doing its init loop, and the subsequent pod to never perform an init. This can cause the MCO pod to degrade via rendered-master not found error, since the initially generated MachineConfig never gets regenerated. The diff that can cause this can be one of the following:

      1. /etc/containers/registries.conf reverting to empty before being regenerated
      2. /etc/kubernetes/kubelet.conf and /var/lib/kubelet/config.json reverting to empty before being regenerated
      3. /etc/mco/internal-registry-pull-secret.json being generated on the first run

      Technically 1 and 2 are "maskable" error if 3 isn't an issue, so the fix options should be:

      1. have the MCO pod re-init if the previous one was unsuccessful (may have to change the init mode detection mechanism)
      2. move registry pull secret to certificate path and remove from machineconfigs

       

      Original description of problem:

      While installing many SNOs via ZTP using ACM, a SNO failed to complete install because the MCO was degraded during the install process.
      
      # oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         20h     Error while reconciling 4.14.0-rc.0: the cluster operator machine-config is degraded
      
      # oc get co
      NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.0-rc.0   True        False         False      173m    
      baremetal                                  4.14.0-rc.0   True        False         False      20h     
      cloud-controller-manager                   4.14.0-rc.0   True        False         False      20h     
      cloud-credential                           4.14.0-rc.0   True        False         False      20h     
      cluster-autoscaler                         4.14.0-rc.0   True        False         False      20h     
      config-operator                            4.14.0-rc.0   True        False         False      20h     
      console                                    4.14.0-rc.0   True        False         False      20h     
      control-plane-machine-set                  4.14.0-rc.0   True        False         False      20h     
      csi-snapshot-controller                    4.14.0-rc.0   True        False         False      20h     
      dns                                        4.14.0-rc.0   True        False         False      20h     
      etcd                                       4.14.0-rc.0   True        False         False      20h     
      image-registry                             4.14.0-rc.0   True        False         False      20h     
      ingress                                    4.14.0-rc.0   True        False         False      20h     
      insights                                   4.14.0-rc.0   True        False         False      20h     
      kube-apiserver                             4.14.0-rc.0   True        False         False      20h     
      kube-controller-manager                    4.14.0-rc.0   True        False         False      20h     
      kube-scheduler                             4.14.0-rc.0   True        False         False      20h     
      kube-storage-version-migrator              4.14.0-rc.0   True        False         False      20h     
      machine-api                                4.14.0-rc.0   True        False         False      20h     
      machine-approver                           4.14.0-rc.0   True        False         False      20h     
      machine-config                                           True        True          True       20h     Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]]
      marketplace                                4.14.0-rc.0   True        False         False      20h     
      monitoring                                 4.14.0-rc.0   True        False         False      20h     
      network                                    4.14.0-rc.0   True        False         False      20h     
      node-tuning                                4.14.0-rc.0   True        False         False      20h     
      openshift-apiserver                        4.14.0-rc.0   True        False         False      114m    
      openshift-controller-manager               4.14.0-rc.0   True        False         False      114m    
      openshift-samples                          4.14.0-rc.0   True        False         False      20h     
      operator-lifecycle-manager                 4.14.0-rc.0   True        False         False      20h     
      operator-lifecycle-manager-catalog         4.14.0-rc.0   True        False         False      20h     
      operator-lifecycle-manager-packageserver   4.14.0-rc.0   True        False         False      20h     
      service-ca                                 4.14.0-rc.0   True        False         False      20h     
      storage                                    4.14.0-rc.0   True        False         False      20h     
      
      # oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master                                                      False     True       True       1              0                   0                     1                      20h
      worker   rendered-worker-b1ae085b0af76cd90252035359d08f1e   True      False      False      0              0                   0                     0                      20h
      
      # oc get mc
      NAME                                                GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
      00-master                                           2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      00-worker                                           2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      01-master-container-runtime                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      01-master-kubelet                                   2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      01-worker-container-runtime                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      01-worker-kubelet                                   2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      04-accelerated-container-startup-master                                                        3.2.0             21h
      04-accelerated-container-startup-worker                                                        3.2.0             21h
      05-kdump-config-master                                                                         3.2.0             21h
      05-kdump-config-worker                                                                         3.2.0             21h
      06-kdump-enable-master                                                                         3.2.0             21h
      06-kdump-enable-worker                                                                         3.2.0             21h
      10-masters-node-ip-hint                                                                        3.1.0             21h
      10-workers-node-ip-hint                                                                        3.1.0             21h
      50-master-dnsmasq-configuration                                                                3.1.0             21h
      50-masters-chrony-configuration                                                                3.1.0             21h
      50-workers-chrony-configuration                                                                3.1.0             21h
      97-master-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      97-worker-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      98-master-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      98-worker-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      99-crio-disable-wipe-master                                                                    3.2.0             21h
      99-crio-disable-wipe-worker                                                                    3.2.0             21h
      99-master-generated-registries                      2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      99-master-ssh                                                                                  3.2.0             21h
      99-worker-generated-registries                      2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      99-worker-ssh                                                                                  3.2.0             21h
      container-mount-namespace-and-kubelet-conf-master                                              3.2.0             21h
      container-mount-namespace-and-kubelet-conf-worker                                              3.2.0             21h
      load-sctp-module-master                                                                        2.2.0             21h
      load-sctp-module-worker                                                                        2.2.0             21h
      rendered-master-048c4c2232a009a3adef2b6b23dff69e    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      rendered-master-9db98f74fc6921ae0f62e5a586e34842    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      rendered-worker-4e06f96d0ae77bdf1dfc8a85022a0a0d    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
      rendered-worker-b1ae085b0af76cd90252035359d08f1e    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h

      Version-Release number of selected component (if applicable):

      Deployed SNO OCP - 4.14.0-rc.0
      Hub 4.13.11
      ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

      How reproducible:

      Rare 1 out of 3618 in this test, in a prior test it was 5 out 3618, thus it could be more frequent than observed in this test

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Name:         machine-config
      Namespace:    
      Labels:       <none>
      Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
                    include.release.openshift.io/self-managed-high-availability: true
                    include.release.openshift.io/single-node-developer: true
      API Version:  config.openshift.io/v1
      Kind:         ClusterOperator
      Metadata:
        Creation Timestamp:  2023-09-12T20:42:51Z
        Generation:          1
        Managed Fields:
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                .:
                f:exclude.release.openshift.io/internal-openshift-hosted:
                f:include.release.openshift.io/self-managed-high-availability:
                f:include.release.openshift.io/single-node-developer:
              f:ownerReferences:
                .:
                k:{"uid":"548d677e-f003-4aab-9492-9a3cb807c476"}:
            f:spec:
          Manager:      cluster-version-operator
          Operation:    Update
          Time:         2023-09-12T20:42:51Z
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:status:
          Manager:      cluster-version-operator
          Operation:    Update
          Subresource:  status
          Time:         2023-09-12T20:42:51Z
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:status:
              f:conditions:
              f:extension:
                .:
                f:master:
                f:worker:
              f:relatedObjects:
          Manager:      machine-config-operator
          Operation:    Update
          Subresource:  status
          Time:         2023-09-13T17:53:10Z
        Owner References:
          API Version:     config.openshift.io/v1
          Controller:      true
          Kind:            ClusterVersion
          Name:            version
          UID:             548d677e-f003-4aab-9492-9a3cb807c476
        Resource Version:  369343
        UID:               c037a04b-6bca-4422-9658-e6d939ce648a
      Spec:
      Status:
        Conditions:
          Last Transition Time:  2023-09-12T20:57:31Z
          Message:               Working towards 4.14.0-rc.0
          Status:                True
          Type:                  Progressing
          Last Transition Time:  2023-09-12T21:03:31Z
          Message:               Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]]
          Reason:                RequiredPoolsFailed
          Status:                True
          Type:                  Degraded
          Last Transition Time:  2023-09-12T21:03:31Z
          Message:               Cluster has deployed []
          Reason:                AsExpected
          Status:                True
          Type:                  Available
          Last Transition Time:  2023-09-12T21:03:54Z
          Message:               One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading
          Reason:                DegradedPool
          Status:                False
          Type:                  Upgradeable
        Extension:
          Master:  pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node vm01213 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f8b0d299858740f4ece5a5dabf600cfd\\\" not found\""
          Worker:  all 0 nodes are at latest configuration rendered-worker-b1ae085b0af76cd90252035359d08f1e
        Related Objects:
          Group:     
          Name:      openshift-machine-config-operator
          Resource:  namespaces
          Group:     machineconfiguration.openshift.io
          Name:      
          Resource:  machineconfigpools
          Group:     machineconfiguration.openshift.io
          Name:      
          Resource:  controllerconfigs
          Group:     machineconfiguration.openshift.io
          Name:      
          Resource:  kubeletconfigs
          Group:     machineconfiguration.openshift.io
          Name:      
          Resource:  containerruntimeconfigs
          Group:     machineconfiguration.openshift.io
          Name:      
          Resource:  machineconfigs
          Group:     
          Name:      
          Resource:  nodes
          Group:     
          Name:      openshift-kni-infra
          Resource:  namespaces
          Group:     
          Name:      openshift-openstack-infra
          Resource:  namespaces
          Group:     
          Name:      openshift-ovirt-infra
          Resource:  namespaces
          Group:     
          Name:      openshift-vsphere-infra
          Resource:  namespaces
          Group:     
          Name:      openshift-nutanix-infra
          Resource:  namespaces
      Events:        <none>

              jerzhang@redhat.com Yu Qi Zhang
              akrzos@redhat.com Alex Krzos
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: