Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27826

MCO failed to drain the node due to the custom catalog source pod with no 'controller: true' ownerReferences

XMLWordPrintable

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The cluster failed to upgrade to 4.13.30 from 4.12.47 due to the MCO failing to evict the no-controller pods. 

          jiazha-mac:~ jiazha$ omg get nodes 
      NAME                       STATUS                    ROLES                 AGE    VERSION
      maxu-47686-b7jsh-rhel-1    Ready                     worker                2h20m  v1.25.16+6df2177
      maxu-47686-b7jsh-rhel-0    Ready                     worker                2h20m  v1.25.16+6df2177
      maxu-47686-b7jsh-worker-2  Ready                     worker                2h51m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-master-0  Ready                     control-plane,master  3h14m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-master-1  Ready                     control-plane,master  3h10m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-master-2  Ready                     control-plane,master  2h56m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-worker-1  Ready                     worker                2h51m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-worker-0  Ready,SchedulingDisabled  worker                2h51m  v1.25.16+5c97f5b
      jiazha-mac:~ jiazha$ omg get mcp
      NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
      worker  rendered-worker-f88335e9afd20564c05e8b0cd4573df2  False    True      True      5             0                  0                    1                     3h10m
      master  rendered-master-1023ac264533c5f1926448ec0a816c28  True     False     False     3             3                  3                    0                     3h10m
      
      jiazha-mac:~ jiazha$ omg get co machine-config -o yaml
      apiVersion: config.openshift.io/v1
      kind: ClusterOperator
      ...
        extension:
          master: all 3 nodes are at latest configuration rendered-master-1023ac264533c5f1926448ec0a816c28
          worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
            status on sync": "Node maxu-47686-b7jsh-worker-0 is reporting: \"failed to drain
            node: maxu-47686-b7jsh-worker-0 after 1 hour. Please see machine-config-controller
            logs for more information\""'
        relatedObjects:
      
      jiazha-mac:~ jiazha$ omg -n openshift-machine-config-operator logs machine-config-controller-74b57df9d6-2gfmp -c machine-config-controller |grep "Drain failed"
      2024-01-23T11:07:39.778225664Z I0123 11:07:39.778190       1 drain_controller.go:139] node maxu-47686-b7jsh-worker-0: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "hello-pod" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-cwsvf" terminating: global timeout reached: 1m30s, error when waiting for pod "ocp-54745-pod-0" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-6cmvf" terminating: global timeout reached: 1m30s]
      ...
      2024-01-23T12:16:39.381322587Z I0123 12:16:39.380041       1 drain_controller.go:139] node maxu-47686-b7jsh-worker-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: [error when waiting for pod "qe-app-registry-cwsvf" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-6cmvf" terminating: global timeout reached: 1m30s]

      Seems like only two `qe-app-registry-xxx` pods are failed to evicted at the end, as follows 

      jiazha-mac:~ jiazha$ omg get pods -o wide -n openshift-marketplace 
      NAME                                   READY  STATUS   RESTARTS  AGE    IP           NODE
      marketplace-operator-845b865dbd-qhd6d  0/1    Running  0         1h13m  10.129.0.19  maxu-47686-b7jsh-master-1
      qe-app-registry-6cmvf                  0/1    Pending  0         2h33m               maxu-47686-b7jsh-worker-0
      qe-app-registry-cwsvf                  0/1    Pending  0         2h17m               maxu-47686-b7jsh-worker-0
       
      jiazha-mac:~ jiazha$ omg get pods qe-app-registry-cwsvf -o yaml
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
          k8s.v1.cni.cncf.io/network-status: "[{\n    \"name\": \"openshift-sdn\",\n   \
            \ \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.131.0.16\"\n    ],\n\
            \    \"default\": true,\n    \"dns\": {}\n}]"
          k8s.v1.cni.cncf.io/networks-status: "[{\n    \"name\": \"openshift-sdn\",\n  \
            \  \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.131.0.16\"\n    ],\n\
            \    \"default\": true,\n    \"dns\": {}\n}]"
          kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"qe-app-registry","namespace":"openshift-marketplace"},"spec":{"displayName":"Production
            Operators","image":"upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12","publisher":"OpenShift
            QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"15m"}}}}
      
      
            '
          openshift.io/scc: anyuid
        creationTimestamp: '2024-01-23T10:00:53Z'
        deletionGracePeriodSeconds: '30'
        deletionTimestamp: '2024-01-23T10:23:01Z'
        generateName: qe-app-registry-
        labels:
          catalogsource.operators.coreos.com/update: qe-app-registry
          olm.catalogSource: ''
          olm.pod-spec-hash: 9b66974d5
        managedFields:
        - apiVersion: v1
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                .: {}
                f:cluster-autoscaler.kubernetes.io/safe-to-evict: {}
                f:kubectl.kubernetes.io/last-applied-configuration: {}
              f:generateName: {}
              f:labels:
                .: {}
                f:catalogsource.operators.coreos.com/update: {}
                f:olm.catalogSource: {}
                f:olm.pod-spec-hash: {}
              f:ownerReferences:
                .: {}
                k:{"uid":"632372ee-c42f-40b8-9da9-dc57097cf4ec"}: {}
            f:spec:
              f:containers:
                k:{"name":"registry-server"}:
                  .: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:livenessProbe:
                    .: {}
                    f:exec:
                      .: {}
                      f:command: {}
                    f:failureThreshold: {}
                    f:initialDelaySeconds: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                    f:timeoutSeconds: {}
                  f:name: {}
                  f:ports:
                    .: {}
                    k:{"containerPort":50051,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:name: {}
                      f:protocol: {}
                  f:readinessProbe:
                    .: {}
                    f:exec:
                      .: {}
                      f:command: {}
                    f:failureThreshold: {}
                    f:initialDelaySeconds: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                    f:timeoutSeconds: {}
                  f:resources:
                    .: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    .: {}
                    f:readOnlyRootFilesystem: {}
                  f:startupProbe:
                    .: {}
                    f:exec:
                      .: {}
                      f:command: {}
                    f:failureThreshold: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                    f:timeoutSeconds: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
              f:dnsPolicy: {}
              f:enableServiceLinks: {}
              f:nodeSelector: {}
              f:restartPolicy: {}
              f:schedulerName: {}
              f:securityContext: {}
              f:serviceAccount: {}
              f:serviceAccountName: {}
              f:terminationGracePeriodSeconds: {}
          manager: catalog
          operation: Update
          time: '2024-01-23T10:00:53Z'
        - apiVersion: v1
          fieldsType: FieldsV1
          fieldsV1:
            f:status:
              f:conditions:
                k:{"type":"ContainersReady"}:
                  .: {}
                  f:lastProbeTime: {}
                  f:lastTransitionTime: {}
                  f:message: {}
                  f:reason: {}
                  f:status: {}
                  f:type: {}
                k:{"type":"Initialized"}:
                  .: {}
                  f:lastProbeTime: {}
                  f:lastTransitionTime: {}
                  f:status: {}
                  f:type: {}
                k:{"type":"Ready"}:
                  .: {}
                  f:lastProbeTime: {}
                  f:lastTransitionTime: {}
                  f:message: {}
                  f:reason: {}
                  f:status: {}
                  f:type: {}
              f:containerStatuses: {}
              f:hostIP: {}
              f:startTime: {}
          manager: kubelet
          operation: Update
          subresource: status
          time: '2024-01-23T10:00:53Z'
        - apiVersion: v1
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                f:k8s.v1.cni.cncf.io/network-status: {}
                f:k8s.v1.cni.cncf.io/networks-status: {}
          manager: multus
          operation: Update
          subresource: status
          time: '2024-01-23T10:00:55Z'
        name: qe-app-registry-cwsvf
        namespace: openshift-marketplace
        ownerReferences:
        - apiVersion: operators.coreos.com/v1alpha1
          blockOwnerDeletion: 'false'
          controller: 'false'
          kind: CatalogSource
          name: qe-app-registry
          uid: 632372ee-c42f-40b8-9da9-dc57097cf4ec
        resourceVersion: '55526'
        uid: de223db3-6a3c-42f8-9c58-e80c8b9837de
      spec:
        containers:
        - image: upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12
          imagePullPolicy: Always
          livenessProbe:
            exec:
              command:
              - grpc_health_probe
              - -addr=:50051
            failureThreshold: '3'
            initialDelaySeconds: '10'
            periodSeconds: '10'
            successThreshold: '1'
            timeoutSeconds: '5'
          name: registry-server
          ports:
          - containerPort: '50051'
            name: grpc
            protocol: TCP
          readinessProbe:
            exec:
              command:
              - grpc_health_probe
              - -addr=:50051
            failureThreshold: '3'
            initialDelaySeconds: '5'
            periodSeconds: '10'
            successThreshold: '1'
            timeoutSeconds: '5'
          resources:
            requests:
              cpu: 10m
              memory: 50Mi
          securityContext:
            capabilities:
              drop:
              - MKNOD
            readOnlyRootFilesystem: 'false'
          startupProbe:
            exec:
              command:
              - grpc_health_probe
              - -addr=:50051
            failureThreshold: '10'
            periodSeconds: '10'
            successThreshold: '1'
            timeoutSeconds: '5'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: FallbackToLogsOnError
          volumeMounts:
          - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
            name: kube-api-access-d8vvx
            readOnly: 'true'
        dnsPolicy: ClusterFirst
        enableServiceLinks: 'true'
        imagePullSecrets:
        - name: qe-app-registry-dockercfg-nkr5j
        nodeName: maxu-47686-b7jsh-worker-0
        nodeSelector:
          kubernetes.io/os: linux
        preemptionPolicy: PreemptLowerPriority
        priority: '0'
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext:
          seLinuxOptions:
            level: s0:c16,c5
        serviceAccount: qe-app-registry
        serviceAccountName: qe-app-registry
        terminationGracePeriodSeconds: '30'
        tolerations:
        - effect: NoExecute
          key: node.kubernetes.io/not-ready
          operator: Exists
          tolerationSeconds: '300'
        - effect: NoExecute
          key: node.kubernetes.io/unreachable
          operator: Exists
          tolerationSeconds: '300'
        - effect: NoSchedule
          key: node.kubernetes.io/memory-pressure
          operator: Exists
        volumes:
        - name: kube-api-access-d8vvx
          projected:
            defaultMode: '420'
            sources:
            - serviceAccountToken:
                expirationSeconds: '3607'
                path: token
            - configMap:
                items:
                - key: ca.crt
                  path: ca.crt
                name: kube-root-ca.crt
            - downwardAPI:
                items:
                - fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
                  path: namespace
            - configMap:
                items:
                - key: service-ca.crt
                  path: service-ca.crt
                name: openshift-service-ca.crt
      status:
        conditions:
        - lastProbeTime: 'null'
          lastTransitionTime: '2024-01-23T10:00:53Z'
          status: 'True'
          type: Initialized
        - lastProbeTime: 'null'
          lastTransitionTime: '2024-01-23T10:00:53Z'
          message: 'containers with unready status: [registry-server]'
          reason: ContainersNotReady
          status: 'False'
          type: Ready
        - lastProbeTime: 'null'
          lastTransitionTime: '2024-01-23T10:00:53Z'
          message: 'containers with unready status: [registry-server]'
          reason: ContainersNotReady
          status: 'False'
          type: ContainersReady
        - lastProbeTime: 'null'
          lastTransitionTime: '2024-01-23T10:00:53Z'
          status: 'True'
          type: PodScheduled
        containerStatuses:
        - image: upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12
          imageID: ''
          lastState: {}
          name: registry-server
          ready: 'false'
          restartCount: '0'
          started: 'false'
          state:
            waiting:
              reason: ContainerCreating
        hostIP: 192.168.0.18
        phase: Pending
        qosClass: Burstable
        startTime: '2024-01-23T10:00:53Z'
      
      

      Version-Release number of selected component (if applicable):

          4.12.47

      How reproducible:

          often

      Steps to Reproduce:

      You can trigger this job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/47686/consoleFull 

          1. build 4.12.47 cluster
          2. upgrade it to the 4.13.30
          3.
          

      Actual results:

      Failed on upgrading.

      jiazha-mac:~ jiazha$ omg get pods -o wide 
      NAME                                   READY  STATUS   RESTARTS  AGE    IP           NODE
      marketplace-operator-845b865dbd-qhd6d  0/1    Running  0         1h13m  10.129.0.19  maxu-47686-b7jsh-master-1
      qe-app-registry-6cmvf                  0/1    Pending  0         2h33m               maxu-47686-b7jsh-worker-0
      qe-app-registry-cwsvf                  0/1    Pending  0         2h17m               maxu-47686-b7jsh-worker-0
      jiazha-mac:~ jiazha$ omg get nodes 
      NAME                       STATUS                    ROLES                 AGE    VERSION
      maxu-47686-b7jsh-rhel-1    Ready                     worker                2h20m  v1.25.16+6df2177
      maxu-47686-b7jsh-rhel-0    Ready                     worker                2h20m  v1.25.16+6df2177
      maxu-47686-b7jsh-worker-2  Ready                     worker                2h51m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-master-0  Ready                     control-plane,master  3h14m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-master-1  Ready                     control-plane,master  3h10m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-master-2  Ready                     control-plane,master  2h56m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-worker-1  Ready                     worker                2h51m  v1.25.16+5c97f5b
      maxu-47686-b7jsh-worker-0  Ready,SchedulingDisabled  worker                2h51m  v1.25.16+5c97f5b
      jiazha-mac:~ jiazha$ 
      jiazha-mac:~ jiazha$ omg get mcp 
      NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
      worker  rendered-worker-f88335e9afd20564c05e8b0cd4573df2  False    True      True      5             0                  0                    1                     3h10m
      master  rendered-master-1023ac264533c5f1926448ec0a816c28  True     False     False     3             3                  3                    0                     3h10m
      jiazha-mac:~ jiazha$ omg get co machine-config
      NAME            VERSION  AVAILABLE  PROGRESSING  DEGRADED  SINCE
      machine-config  4.12.47  True       False        True      3h7m    

      Expected results:

          The cluster can be updated successfully.

      Additional info:

      The must-gather log link: https://drive.google.com/file/d/1BRJPwc8YAtVh0x6PD4wyB5TPzl7qdVHS/view?usp=drive_link 

            krizza@redhat.com Kevin Rizza
            rhn-support-jiazha Jian Zhang
            Jian Zhang Jian Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: